Runnable Examples¶

This repository ships two small example campaigns that are designed to be run as-is and used as starting points for your own benchmarks.

Both are intentionally minimal: one model, a small set of cases, one judge call per case.

For documentation purposes, the repository also commits regenerated artifacts for these two campaigns under outputs/. Treat them as example outputs that show the expected layout and reporting format for a real run.

llm_probe campaign¶

Tests a raw LLM with tool access.

Files¶

File	Path
Tool case	`configs/cases/llm_probe_tool_example/test.yaml`
Browser case	`configs/cases/llm_probe_browser_example/test.yaml`
Suite	`configs/suites/llm_probe_examples.yaml`
Run profile	`configs/run_profiles/llm_probe_examples.yaml`
Evaluation profile	`configs/evaluation_profiles/judge_gpt54_mini.yaml`

Command¶

uv run pae run-eval \
  --suite llm_probe_examples \
  --run-profile llm_probe_examples \
  --evaluation-profile judge_gpt54_mini

What it costs (approximate)¶

Run model: minimax/minimax-m2.7 — very cheap
Judge: openai/gpt-5.4-mini — 1 judge call per case
Total: roughly $0.002–$0.005 for both cases

OpenClaw campaign¶

Tests a full autonomous agent running in Docker.

Files¶

File	Path
Reusable agent	`configs/agents/basic_agent/`
Tool case	`configs/cases/openclaw_tool_example/test.yaml`
Browser case	`configs/cases/openclaw_browser_example/test.yaml`
Multiturn case	`configs/cases/openclaw_multiturn_example/test.yaml`
Suite	`configs/suites/openclaw_examples.yaml`
Run profile	`configs/run_profiles/openclaw_examples.yaml`
Evaluation profile	`configs/evaluation_profiles/judge_gpt54_mini.yaml`

Command¶

uv run pae run-eval \
  --suite openclaw_examples \
  --run-profile openclaw_examples \
  --evaluation-profile judge_gpt54_mini

Prerequisites¶

Docker with network access (to pull ghcr.io/openclaw/openclaw:2026.4.15)
OPENROUTER_API_KEY set on the host (forwarded into the container)

What gets written¶

After running either campaign, the outputs/ directory contains:

outputs/
├── charts/
│   └── judge_gpt54/
│       └── score_cost.png                       ← quality vs cost bubble chart
├── runs/
│   └── suit_<suite_id>/
│       └── run_profile_<fp6>/
│           └── <model_id>/
│               └── <case_id>/
│                   ├── run_1.json               ← raw run trace
│                   ├── run_1.artifacts/         ← OpenClaw evidence files (openclaw only)
│                   └── run_1.fingerprint_input.json
└── evaluations/
    └── suit_<suite_id>/
        └── evaluation_profile_<fp6>/
            └── eval_profile_judge_gpt54_<fp6>/
                └── <model_id>/
                    └── <case_id>/
                        ├── evaluation_result_summary_1.md   ← start here
                        ├── judge_1.prompt.debug.md          ← exact judge prompt
                        └── raw_outputs/
                            ├── final_result_1.json          ← structured final evaluation result
                            ├── judge_1.json                 ← raw judge response
                            └── judge_1.prompt.user.json     ← structured subject view

In this repository, the checked-in example artifacts live under the concrete example campaign paths:

outputs/runs/suit_llm_probe_examples/
outputs/evaluations/suit_llm_probe_examples/
outputs/runs/suit_openclaw_examples/
outputs/evaluations/suit_openclaw_examples/
outputs/charts/judge_gpt54/

Second run: reuse in action¶

Run the same command again:

uv run pae run-eval \
  --suite llm_probe_examples \
  --run-profile llm_probe_examples \
  --evaluation-profile judge_gpt54_mini

The terminal output now shows reuse in both the RUN and EVAL columns. No tokens are spent. The framework computes the expected fingerprint for each (model, case, repetition), finds a matching artifact in outputs/, and loads it directly.

→ Fingerprints & reuse

Extending the examples¶

Add a new case¶

Create configs/cases/my_case/test.yaml or configs/cases/<group>/my_case/test.yaml
Add my_case to case_selection.include_case_ids in the suite

The next run computes only the new case. Existing results are reused.

Add a new model¶

Add an entry to models: in the suite YAML
Re-run — only the new model's cases are executed

Increase repetitions¶

Set run_repetitions: 3 in the run profile. The fingerprint changes (new directory), and all cases are re-run with 3 repetitions. Scores are averaged across repetitions in the report.

Change the judge model¶

Edit model: in the evaluation profile. The evaluation fingerprint changes → a new eval_profile_<fp6> directory is created → all evaluations re-run with the new judge. Run artifacts are reused unchanged.

Runnable Examples¶

llm_probe campaign¶

Files¶

Command¶

What it costs (approximate)¶

OpenClaw campaign¶

Files¶

Command¶

Prerequisites¶

What gets written¶

Recommended reading order¶

Second run: reuse in action¶

Extending the examples¶

Add a new case¶

Add a new model¶

Increase repetitions¶

Change the judge model¶