Concepts¶
This page explains the core ideas behind personal_agent_eval — what it measures, why it is designed the way it is, and how the pieces fit together.
What is being evaluated?¶
Every test case asks a model or agent to do something, then measures:
- Did it do the right thing? (task completion)
- Did it do it the right way? (process quality)
- How much did it help beyond the basics? (autonomy, closeness, efficiency, spark)
These are captured as six scoring dimensions on a 0–10 scale.
The six dimensions¶
| Dimension | What it measures |
|---|---|
task |
Whether the output fulfills the stated goal — correct content, correct files, correct format |
process |
Whether the agent followed a sound approach — used the right tools, respected constraints, didn't hallucinate steps |
autonomy |
Whether the agent operated independently — made sensible decisions without over-asking or hand-holding |
closeness |
Whether the output matched what a good human response would look like — tone, framing, completeness |
efficiency |
Whether the agent achieved the goal with reasonable resource use — no unnecessary tool calls, no verbose noise |
spark |
Whether the response showed something noteworthy — useful insight, elegant shortcut, thoughtful initiative |
These dimensions are scored by the judge, with deterministic checks preserved as supporting evidence.
Final score
The final_score (0–10) comes from the judge's overall assessment (judge_overall.score), not from a weighted average of the six dimensions. The dimensions are there for diagnosis — to understand where a model is strong or weak. The overall score is the judge's holistic verdict after reviewing all evidence.
Two evaluation layers¶
Deterministic layer¶
Runs directly against the stored RunArtifact — no LLM required. Examples:
- Did the final response exist? (
final_response_present) - Was the right file created? (
file_contains) - Was the tool called the expected number of times? (
tool_call_count)
These checks are fast, stable, and free. They produce hard pass/fail signals that help the judge and help you debug what happened in a run.
Judge layer¶
An LLM judge (configured in the evaluation profile) reads a compact, structured view of the run and scores each dimension. The judge sees:
- the original task (messages + expectations)
- the final response
- the tool activity summary
- key generated artifacts (for OpenClaw runs)
- the deterministic check summary
The judge does not see internal infrastructure details (run IDs, container image names, file paths, raw provider metadata). The prompt is designed to surface only what matters for evaluation.
The exact prompt sent to the judge is persisted as judge_1.prompt.debug.md next to the evaluation results, so you can always audit what was evaluated and how.
Judge-first evaluation¶
The evaluation flow has two layers, but only one scorer:
- Deterministic checks create stable evidence and summaries.
- The judge reads that evidence, plus the task and run outputs, and assigns the six dimension scores and the overall
final_score.
In other words: deterministic checks inform the evaluation, but the judge decides the score.
→ Hybrid evaluation — artifact-level reference
Runner types¶
llm_probe¶
llm_probe sends the test case input directly to a model via OpenRouter and orchestrates a tool-use loop. It is the simplest runner: one API call (or a few for multi-turn tool use), one RunArtifact.
Tools available: exec_shell, read_file, write_file, web_search (and others configured in input.context.llm_probe.tools).
openclaw¶
openclaw runs a full autonomous agent inside a pinned Docker container. The framework:
- Copies a workspace template from
configs/agents/<agent_id>/workspace/ - Generates a
openclaw.jsonconfig in the workspace - Invokes the container:
docker run <image> openclaw agent run ... - Captures the workspace diff, logs, key output files, and session trace as evidence
- Evaluates the evidence through the same deterministic + judge pipeline
This lets you benchmark real agent behavior — including multi-step planning, file creation, error recovery — in a reproducible, sandboxed environment.
Incremental campaigns¶
A campaign is the combination of a suite, a run profile, and the models in that suite. Campaigns are stored under outputs/ with paths derived from the suite ID and a fingerprint of the run profile.
When you run the same campaign twice, the framework checks the stored fingerprint for each (model, case, repetition) combination. If it matches, that result is reused. If it does not, only that combination is re-executed.
This means you can:
- Add a new case to a suite without re-running existing cases
- Increase
run_repetitionsand only the new repetitions are run - Change
temperatureand get a new fingerprint → new directory → all cases re-run (but old results preserved)
Repetitions¶
Both runs and evaluations support repetitions:
run_repetitions(inrun_profile.yaml) — how many times to execute each case. Useful for measuring model consistency. Each repetition is stored as a separaterun_N.json.judge repetitions(inevaluation_profile.yaml) — how many times to call the judge per case. Scores are aggregated (typically median) across successful iterations.
When multiple run repetitions exist, the workflow takes the mean final_score across them before reporting.
Cost tracking¶
Every run and evaluation records token usage and estimated USD cost. The CLI reports:
RUN_COST— tokens spent by the subject modelEVAL_COST— tokens spent by the judgeTOTAL_COST— sum of both
The structured JSON output also includes per-model and per-suite totals. This makes it easy to compare models not just on quality but on quality-per-dollar.