Getting Started¶
This guide walks you from a fresh checkout to a running benchmark in about five minutes.
Prerequisites¶
- Python 3.12+ with uv installed
- An OpenRouter API key (used for both the subject model and the judge)
- Docker — only required for
openclawruns; not needed forllm_probe
Install¶
# clone the repo
git clone <repo-url>
cd benchmark-openclaw-llm
# install all dependencies (including dev tools)
uv sync --group dev
Set your API key¶
The CLI loads .env from the repository root automatically:
Or export it directly:
Run the llm_probe example¶
The repository ships a ready-to-run llm_probe campaign:
uv run pae run-eval \
--suite llm_probe_examples \
--run-profile llm_probe_examples \
--evaluation-profile judge_gpt54_mini
This single command:
- Runs two test cases against
minimax/minimax-m2.7via OpenRouter - Evaluates each result with a GPT-5.4-mini judge
- Reports scores, latency, and cost to the terminal
- Writes a score-vs-cost chart PNG to
outputs/charts/
The terminal output looks like:
Suite: llm_probe_examples
Run profile: llm_probe_examples
Evaluation: judge_gpt54_mini
Run cost: $0.0003
Evaluation cost: $0.0018
Total cost: $0.0021
MODEL CASE RUN EVAL SCORE LATENCY_S TOTAL_COST
───────────── ──────────────────────────── ──── ──── ───── ───────── ──────────
minimax_m27 llm_probe_tool_example exec exec 8.50 12.3 $0.001
minimax_m27 llm_probe_browser_example exec exec 7.00 9.1 $0.001
Understand the output files¶
After the run, four kinds of files appear under outputs/:
outputs/
├── charts/
│ └── judge_gpt54/
│ └── score_cost.png ← score vs cost bubble chart
├── runs/
│ └── suit_llm_probe_examples/
│ └── run_profile_<fp6>/
│ └── minimax_m27/
│ └── llm_probe_tool_example/
│ ├── run_1.json ← raw run trace + token usage
│ └── run_1.fingerprint_input.json ← what was hashed
└── evaluations/
└── suit_llm_probe_examples/
└── evaluation_profile_<fp6>/
└── eval_profile_judge_gpt54_<fp6>/
└── minimax_m27/
└── llm_probe_tool_example/
├── evaluation_result_summary_1.md ← start here
├── judge_1.prompt.debug.md ← exact prompt shown to the judge
└── raw_outputs/
├── final_result_1.json ← hybrid score breakdown
├── judge_1.json ← raw judge response
└── judge_1.prompt.user.json ← structured judge payload
Start reading from evaluation_result_summary_1.md — it is a human-readable Markdown file with the score, the judge's evidence, and the dimension breakdown.
To see exactly what the judge saw, open judge_1.prompt.debug.md.
Run the OpenClaw example¶
OpenClaw evaluates a full autonomous agent. It requires Docker and a container image pull:
uv run pae run-eval \
--suite openclaw_examples \
--run-profile openclaw_examples \
--evaluation-profile judge_gpt54_mini
The framework:
- Materializes an ephemeral workspace from
configs/agents/basic_agent/ - Generates a per-run
openclaw.jsonconfig file in that workspace - Invokes
docker run ghcr.io/openclaw/openclaw:2026.4.15 openclaw ... - Captures the workspace diff, logs, and key outputs as evidence
- Evaluates them through the same deterministic + judge pipeline
→ See Minimal OpenClaw example for the full layout.
What happens on the second run?¶
Run the same command again:
uv run pae run-eval \
--suite llm_probe_examples \
--run-profile llm_probe_examples \
--evaluation-profile judge_gpt54_mini
The RUN and EVAL columns now show reuse instead of exec. No tokens were spent. The framework compared the stored fingerprints with the current config and found exact matches.
This is the fingerprint reuse system. → Read how fingerprints work
Core mental model¶
Four YAML files define a complete benchmark campaign:
| File | Answers |
|---|---|
configs/cases/<id>/test.yaml or configs/cases/<group>/<id>/test.yaml |
What to test |
configs/suites/<id>.yaml |
Which cases and which models |
configs/run_profiles/<id>.yaml |
How to execute (temperature, retries, repetitions…) |
configs/evaluation_profiles/<id>.yaml |
How to judge and aggregate scores |
You can mix and match: the same case can appear in multiple suites, the same suite can be evaluated with different judge profiles, and the same run profile can be reused across suites.
→ Config model — visual diagram of how they fit together
Next steps¶
- Concepts — scoring dimensions, judge-first evaluation, and reuse
- Configuration reference — complete YAML field reference
- CLI reference — all
paecommands and flags - Fingerprints & reuse — how to force a re-run, what changes a fingerprint
- Runnable examples — walk through the shipped configs