Config Model¶
personal_agent_eval is driven by five YAML configuration surfaces. Each one answers a different question, and together they define a complete benchmark campaign.
| Config type | File path | Answers |
|---|---|---|
| Test case | configs/cases/<case_id>/test.yaml or grouped as configs/cases/<group>/<case_id>/test.yaml |
What to test |
| Suite | configs/suites/<suite_id>.yaml |
Which cases and models |
| Run profile | configs/run_profiles/<profile_id>.yaml |
How to execute |
| Evaluation profile | configs/evaluation_profiles/<profile_id>.yaml |
How to judge |
| OpenClaw agent | configs/agents/<agent_id>/agent.yaml + workspace/ |
Which reusable agent workspace |
How the four core configs relate¶
flowchart TD
T["<b>test.yaml</b><br/>runner · input · expectations · checks · rubric"]
S["<b>suite.yaml</b><br/>cases × models"]
R["<b>run_profile.yaml</b><br/>temperature · max_tokens · repetitions"]
E["<b>evaluation_profile.yaml</b><br/>judge · aggregation · dimensions"]
W["<b>pae run-eval</b><br/>--suite · --run-profile · --evaluation-profile"]
A["<b>RunArtifact</b><br/>per model × case × run_N"]
F["<b>FinalEvaluationResult</b><br/>deterministic + judge score"]
P["<b>Report</b><br/>pae report"]
T -->|"selected by id / tag"| S
S -->|"--suite"| W
R -->|"--run-profile"| W
E -->|"--evaluation-profile"| W
W -->|"llm_probe or openclaw runner"| A
A -->|"deterministic + judge"| F
F --> P
classDef cfg fill:#E8EAF6,stroke:#5C6BC0,color:#1A237E
classDef orch fill:#DCE3FF,stroke:#5C6BC0,color:#1A237E
classDef art fill:#E8F5E9,stroke:#43A047,color:#1B5E20
class T,S,R,E cfg
class W orch
class A,F,P art
OpenClaw execution flow¶
When a case uses runner.type: openclaw, the agent definition and the openclaw: block in the run profile come into play. A suite can also override the default agent per case with openclaw.agent_assignments.
flowchart TD
TC["<b>test.yaml</b><br/>runner.type: openclaw<br/>input.messages or input.turns<br/>expectations · checks"]
RP["<b>run_profile.yaml</b><br/>default openclaw.agent_id<br/>openclaw.image · timeout"]
SA["<b>suite.yaml</b><br/>optional openclaw.agent_assignments"]
AC["<b>configs/agents/<agent_id>/</b><br/>agent.yaml<br/>workspace/ template"]
FP["<b>Fingerprint check</b><br/>SHA-256 of all inputs<br/>+ agent + workspace"]
WS["<b>Ephemeral workspace</b><br/>temp dir · workspace copied<br/>openclaw.json generated"]
DC["<b>docker run <image></b><br/>openclaw CLI inside container<br/>OPENROUTER_API_KEY forwarded"]
RA["<b>RunArtifact</b><br/>runner_metadata.openclaw<br/>workspace diff · logs · key outputs"]
DE["<b>Deterministic checks</b><br/>openclaw_workspace_file_present<br/>file_contains · etc."]
JU["<b>Judge</b><br/>compact subject view<br/>task + response + evidence"]
FE["<b>FinalEvaluationResult</b><br/>hybrid score 0–10"]
TC --> FP
RP --> FP
SA --> AC
AC --> FP
FP -->|"cache miss → execute"| WS
WS -->|"docker run"| DC
DC -->|"captures"| RA
RA --> DE
RA --> JU
DE --> FE
JU --> FE
classDef cfg fill:#E8EAF6,stroke:#5C6BC0,color:#1A237E
classDef orch fill:#DCE3FF,stroke:#5C6BC0,color:#1A237E
classDef art fill:#E8F5E9,stroke:#43A047,color:#1B5E20
classDef exec fill:#FFF8E1,stroke:#F9A825,color:#5D4037
class TC,RP,AC cfg
class FP,WS,DC exec
class RA,DE,JU,FE art
For single-turn OpenClaw cases, input.messages is rendered into one openclaw agent --message call. For multiturn cases, define input.turns; the runner invokes OpenClaw once per turn while reusing the same generated openclaw.json, workspace, OPENCLAW_STATE_DIR, and explicit --session-id.
llm_probe execution flow¶
For runner.type: llm_probe, the runner calls OpenRouter directly and manages a tool-use loop until the model produces a final response or reaches max_turns.
flowchart TD
TC["<b>test.yaml</b><br/>runner.type: llm_probe<br/>input.messages<br/>input.context.llm_probe.tools"]
RP["<b>run_profile.yaml</b><br/>temperature · max_tokens<br/>max_turns · retries"]
OR["<b>OpenRouter API</b><br/>model call with tool definitions"]
TL["<b>Tool execution loop</b><br/>exec_shell · read_file · write_file<br/>web_search · …"]
RA["<b>RunArtifact</b><br/>message trace · tool calls<br/>token usage · final output"]
DE["<b>Deterministic checks</b><br/>final_response_present<br/>file_contains · tool_call_count · etc."]
JU["<b>Judge</b><br/>compact subject view<br/>task + response + tool activity"]
FE["<b>FinalEvaluationResult</b><br/>hybrid score 0–10"]
TC --> OR
RP --> OR
OR -->|"tool_call → tool_result loop"| TL
TL --> OR
OR -->|"finish"| RA
RA --> DE
RA --> JU
DE --> FE
JU --> FE
classDef cfg fill:#E8EAF6,stroke:#5C6BC0,color:#1A237E
classDef exec fill:#FFF8E1,stroke:#F9A825,color:#5D4037
classDef art fill:#E8F5E9,stroke:#43A047,color:#1B5E20
class TC,RP cfg
class OR,TL exec
class RA,DE,JU,FE art
What each config controls¶
test.yaml — the atomic test case¶
Defines one scenario in full isolation. The same case can be included in multiple suites and run against multiple models without modification.
schema_version: 1
case_id: llm_probe_tool_example
title: "llm_probe tool example"
runner:
type: llm_probe # or: openclaw
input:
messages:
- role: user
content: |
Use real tools to create a file...
context:
llm_probe:
tools:
- exec_shell
- write_file
- read_file
expectations:
hard_expectations:
- text: Uses tools to obtain the content instead of inventing it.
soft_expectations:
- text: Response is brief and clearly confirms the final file content.
rubric:
version: 1
scale:
min: 0
max: 10
anchors:
"10": All required steps completed; artifacts present; confirmation clear.
"0": No attempt or irrelevant output.
criteria:
- name: Tool-grounded correctness
what_good_looks_like: Uses required tools and reports observed results.
what_bad_looks_like: Invents results or skips required steps.
deterministic_checks:
- check_id: final-response-present
dimensions: [task]
declarative:
kind: final_response_present
- check_id: file-written
dimensions: [process]
declarative:
kind: file_contains
path: /tmp/expected_output.txt
text: expected-marker
tags:
- example
- llm_probe
OpenClaw follow-up scenarios use turns:
runner:
type: openclaw
input:
messages:
- role: system
content: Keep context across turns.
turns:
- role: user
content: Create draft.md.
- role: user
content: Revise it and save report.md.
suite.yaml — the campaign scope¶
Lists which cases and which models form the benchmark.
schema_version: 1
suite_id: llm_probe_examples
title: "llm_probe runnable examples"
models:
- model_id: minimax_m27
requested_model: minimax/minimax-m2.7
label: minimax/minimax-m2.7
case_selection:
include_case_ids:
- llm_probe_tool_example
- llm_probe_browser_example
You can select cases by tag instead of (or in addition to) explicit IDs:
run_profile.yaml — execution policy¶
Controls how the runner calls the model. A fingerprint of the effective execution settings scopes campaign directories.
schema_version: 1
run_profile_id: llm_probe_examples
runner_defaults:
temperature: 0
timeout_seconds: 90
max_tokens: 768
max_turns: 6
retries: 0
execution_policy:
max_concurrency: 1
run_repetitions: 1
fail_fast: true
stop_on_runner_error: true
For OpenClaw, add the openclaw: block:
schema_version: 1
run_profile_id: openclaw_examples
openclaw:
agent_id: basic_agent
image: ghcr.io/openclaw/openclaw:2026.4.15
timeout_seconds: 300
execution_policy:
max_concurrency: 1
run_repetitions: 1
fail_fast: true
openclaw.agent_id is the default agent for OpenClaw cases. To use multiple agents in one suite, add suite-level assignments:
openclaw:
agent_assignments:
- agent_id: agent_1
case_selection:
include_case_ids: [case_a, case_b, case_c, case_d]
- agent_id: agent_2
case_selection:
include_case_ids: [case_e, case_f]
- agent_id: agent_3
case_selection:
include_case_ids: [case_g, case_h, case_i, case_j, case_k]
evaluation_profile.yaml — judge policy¶
Defines one or more LLM judges, how repeated judge runs aggregate, and security controls.
schema_version: 1
evaluation_profile_id: judge_gpt54
judges:
- judge_id: gpt54_judge
type: llm_probe
model: openai/gpt-5.4-mini
judge_runs:
- judge_run_id: gpt54_single
judge_id: gpt54_judge
repetitions: 1
aggregation:
method: median
security_policy:
allow_local_python_hooks: false
redact_secrets: true
final_score
final_score is the judge's holistic overall.score (0–10). Deterministic checks are preserved as supporting evidence for the judge and for debugging, but they do not compute the top-level score.
configs/agents/<agent_id>/ — reusable OpenClaw agent¶
configs/agents/basic_agent/
agent.yaml ← agent identity, model defaults, sandbox settings
workspace/
AGENTS.md ← workspace template (copied to every run)
SOUL.md
Campaign storage layout¶
outputs/
├── charts/
│ └── {evaluation_profile_id}/
│ └── score_cost.png
├── runs/
│ └── suit_{suite_id}/
│ └── run_profile_{fp6}/
│ └── {model_id}/
│ └── {case_id}/
│ ├── run_1.json
│ ├── run_1.fingerprint_input.json
│ └── run_2.json ← when run_repetitions > 1
└── evaluations/
└── suit_{suite_id}/
└── evaluation_profile_{fp6}/
└── eval_profile_{eval_id}_{fp6}/
└── {model_id}/
└── {case_id}/
├── evaluation_result_summary_1.md
├── judge_1.prompt.debug.md
└── raw_outputs/
├── final_result_1.json
├── judge_1.json
└── judge_1.prompt.user.json
fp6 is the first 6 characters of the SHA-256 fingerprint. → Fingerprints & reuse