Configuration Reference¶

All YAML files must declare schema_version: 1 at the top. The loader raises an explicit ConfigError on any schema violation — there are no silent defaults for required fields.

Quick reference¶

Config type	Default path	ID field
Test case	`configs/cases/<case_id>/test.yaml` or grouped as `configs/cases/<group>/<case_id>/test.yaml`	`case_id`
Suite	`configs/suites/<suite_id>.yaml`	`suite_id`
Run profile	`configs/run_profiles/<profile_id>.yaml`	`run_profile_id`
Evaluation profile	`configs/evaluation_profiles/<profile_id>.yaml`	`evaluation_profile_id`
OpenClaw agent	`configs/agents/<agent_id>/agent.yaml` + `workspace/`	`agent_id`

CLI flags (--suite, --run-profile, --evaluation-profile) accept either an explicit YAML path or just the plain ID, which is resolved automatically under the conventional directory.

Test case (`test.yaml`)¶

Defines one atomic evaluation scenario. The same case can be included in multiple suites and evaluated against multiple models without modification.

Top-level fields¶

Field	Type	Required	Description
`schema_version`	`1`	yes	Must be `1`
`case_id`	slug	yes	Unique identifier; used as directory name in `outputs/`
`title`	string	yes	Human-readable label
`runner`	`RunnerConfig`	yes	Runner type and optional case-level overrides
`input`	`TestInput`	yes	Messages, attachments, and runner context
`expectations`	`Expectations`	no	Hard/soft expectation list for the judge
`rubric`	`Rubric`	no	Scored anchors and criteria shown to the judge
`deterministic_checks`	list	no	Checks run against the `RunArtifact` without an LLM
`tags`	list of strings	no	Free tags; used by `case_selection.include_tags` / `exclude_tags`
`metadata`	mapping	no	Arbitrary key/value; not used by the framework

`runner`¶

Field	Type	Required	Description
`type`	`"llm_probe"` or `"openclaw"`	yes	Selects the runner implementation

Any extra fields (e.g. temperature: 0) are treated as case-level runner overrides and take precedence over run_profile.runner_defaults but are overridden by run_profile.model_overrides.

`input`¶

Field	Type	Default	Description
`messages`	list of `Message`	`[]`	Ordered message sequence sent to the model; for OpenClaw multiturn cases, optional initial context
`turns`	list of `Message`	`[]`	OpenClaw-only user turns executed as separate agent invocations in one session
`attachments`	list of paths	`[]`	Local files injected as extra user messages
`context`	mapping	`{}`	Runner-specific context: tools, openclaw hints, etc.

All relative paths in input resolve relative to the test.yaml file.

Message¶

Field	Type	Required	Description
`role`	`"system"` / `"user"` / `"assistant"` / `"tool"`	yes	Message role
`content`	string	one of `content` / `source`	Inline message text
`source`	`MessageSource`	one of `content` / `source`	External file reference
`name`	string	no	Optional name annotation

content and source are mutually exclusive.

MessageSource¶

source:
  path: messages/user_prompt.yaml   # relative to test.yaml
  format: yaml                      # optional; auto-detected from extension

The referenced file may resolve to a single message object or a list of messages.

OpenClaw multiturn input¶

For runner.type: openclaw, input.messages preserves the existing single-turn behavior: the messages are rendered into one OpenClaw --message invocation. To test follow-up user messages, use input.turns. Each turn is sent through openclaw agent --session-id <run-session> using the same ephemeral workspace and OPENCLAW_STATE_DIR.

input:
  messages:
    - role: system
      content: Keep context across user turns.
  turns:
    - role: user
      content: Create draft.md.
    - role: user
      content: Revise draft.md and create report.md.
  context:
    openclaw:
      expected_artifact: report.md

When turns is present, messages is treated as initial context and included with the first turn only. Final workspace checks run after the last successful turn, and the raw session trace records all turn payloads.

Attachments¶

Files in input.attachments are injected as additional user messages just before the first user message:

Attached context file: <filename>

--- BEGIN ATTACHMENT <filename> ---
<file content>
--- END ATTACHMENT <filename> ---

Attachments are content-addressed in the fingerprint (SHA-256 + byte size), not path-addressed.

`input.context.llm_probe`¶

Field	Type	Description
`tools`	list of strings	Tools to expose to the model: `exec_shell`, `write_file`, `read_file`, `web_search`, etc.

`input.context.openclaw`¶

Field	Type	Description
`expected_artifact`	string	Hint for the harness: the filename of the expected output in the workspace

`expectations`¶

Shown to the judge as part of the evaluation target. Hard expectations are treated as critical requirements; soft ones are scored with partial credit.

Field	Type	Description
`hard_expectations`	list of `Expectation`	Requirements the judge must treat as critical
`soft_expectations`	list of `Expectation`	Requirements scored with partial credit

Expectation¶

Field	Type	Required	Description
`text`	string	yes	Natural-language expectation statement
`weight`	float	no (default `1.0`)	Relative weight within the group

`rubric`¶

An optional structured rubric shown to the judge to calibrate scoring. It replaces free-form judge guessing with explicit anchors and criteria.

rubric:
  version: 1
  scale:
    min: 0
    max: 10
    anchors:
      "10": All required steps completed correctly and concisely.
      "7": Mostly correct with minor clarity issues.
      "4": Partial completion; missing a required step.
      "0": No attempt, irrelevant output, or empty.
  criteria:
    - name: Tool-grounded correctness
      what_good_looks_like: Uses required tools and reports observed results.
      what_bad_looks_like: Invents results or skips required steps.
    - name: Concise confirmation
      what_good_looks_like: Confirms actions in 2–4 lines.
      what_bad_looks_like: Overly verbose or unclear.
  scoring_instructions: >
    Use this rubric to set overall.score. Cap at ≤ 4 if a hard
    expectation or deterministic check fails.

`deterministic_checks`¶

Each entry is a DeterministicCheck:

Field	Type	Required	Description
`check_id`	slug	yes	Unique within the case
`dimensions`	list of dimension names	no	Dimensions this check affects in aggregation
`declarative`	`DeclarativeCheck`	one of	Built-in check specification
`python_hook`	`PythonHook`	one of	Custom Python callable

declarative and python_hook are mutually exclusive.

Valid dimension names: task, process, autonomy, closeness, efficiency, spark.

Declarative check kinds¶

`kind`	Extra fields	Description
`final_response_present`	—	Non-empty final output in the trace (also checks last assistant message or workspace output for openclaw)
`tool_call_count`	`count` (int)	Exact tool call count
`file_exists`	`path`	File exists on host filesystem
`file_contains`	`path`, `text`	File exists and contains substring
`path_exists`	`path`	Path exists (file or directory)
`status_is`	`status`	Terminal run status matches
`output_artifact_present`	`artifact_type`?	Run artifact records a matching output artifact
`openclaw_workspace_file_present`	`relative_path`, `contains`?, `contains_all`?, `contains_any`?	Workspace diff contains the file (OpenClaw only); `contains_all` / `contains_any` use normalized case/accent-insensitive matching

Paths resolve relative to test.yaml.

For openclaw_workspace_file_present, use either legacy contains or the normalized contains_all / contains_any fields. The loader rejects checks that mix both styles.

PythonHook¶

python_hook:
  path: hooks/my_check.py        # relative to test.yaml (mutually exclusive with import_path)
  # import_path: my_pkg.checks   # dotted import path (mutually exclusive with path)
  callable_name: check_output

Warning

Python hooks are disabled by default. Enable them with security_policy.allow_local_python_hooks: true in the evaluation profile.

Minimal `test.yaml`¶

schema_version: 1
case_id: my_case
title: My case
runner:
  type: llm_probe
input:
  messages:
    - role: user
      content: What is 2 + 2?
expectations:
  hard_expectations:
    - text: Answers with 4.
deterministic_checks:
  - check_id: response-present
    dimensions: [task]
    declarative:
      kind: final_response_present

Suite (`suite.yaml`)¶

Groups models with a case selection policy to form a benchmark campaign.

Top-level fields¶

Field	Type	Required	Description
`schema_version`	`1`	yes	Must be `1`
`suite_id`	slug	yes	Unique identifier
`title`	string	yes	Human-readable label
`models`	list of `ModelConfig`	no	Models to run against the selected cases
`case_selection`	`CaseSelection`	no	Filters determining which cases are included
`openclaw`	`SuiteOpenClawConfig`	no	Optional per-case OpenClaw agent assignments
`metadata`	mapping	no	Arbitrary annotation bag

`models` — ModelConfig¶

Field	Type	Required	Description
`model_id`	slug	yes	Local name used in artifact paths and reports
`label`	string	no	Human-readable display name
`requested_model`	string	no	OpenRouter model string (e.g. `openai/gpt-4o-mini`)

Any extra fields are passed through to the runner. The llm_probe runner resolves the model in this priority order: requested_model → openrouter_model → provider + model_name → model_id.

For OpenClaw campaigns, extra model fields can also carry OpenRouter model parameters. The current benchmark uses primary_params.reasoning.effort to control GPT-5 reasoning on the primary model, for example:

models:
  - model_id: gpt55
    requested_model: openai/gpt-5.5
    primary_params:
      reasoning:
        effort: medium

`case_selection`¶

Field	Type	Description
`include_case_ids`	list of strings	Explicit case IDs to include
`exclude_case_ids`	list of strings	Explicit case IDs to exclude
`include_tags`	list of strings	Include cases that have at least one of these tags
`exclude_tags`	list of strings	Exclude cases that have at least one of these tags

Precedence: include_case_ids > tag filters > exclude_case_ids. Unknown case IDs in include_case_ids are a hard error.

`openclaw.agent_assignments`¶

Use this when one suite should run different OpenClaw cases with different reusable agents. Each assignment selects cases by ID and/or tag and points them at configs/agents/<agent_id>/. Cases that match no assignment use run_profile.openclaw.agent_id as the default. A case that matches more than one assignment is rejected before execution.

openclaw:
  agent_assignments:
    - agent_id: agent_1
      case_selection:
        include_case_ids: [case_a, case_b, case_c, case_d]
    - agent_id: agent_2
      case_selection:
        include_case_ids: [case_e, case_f]
    - agent_id: agent_3
      case_selection:
        include_tags: [long_context]
        exclude_case_ids: [case_f]

Example¶

schema_version: 1
suite_id: my_suite
title: My benchmark suite
models:
  - model_id: gpt4o_mini
    requested_model: openai/gpt-4o-mini
    label: GPT-4o mini
  - model_id: minimax_m27
    requested_model: minimax/minimax-m2.7
case_selection:
  include_tags: [smoke]
  exclude_case_ids: [known_flaky_case]
openclaw:
  agent_assignments:
    - agent_id: support_agent
      case_selection:
        include_tags: [support_agent]

Run profile (`run_profile.yaml`)¶

Controls execution behavior. A SHA-256 fingerprint of the effective settings scopes campaign storage directories — changing any execution parameter produces a new fingerprint and a new directory.

Top-level fields¶

Field	Type	Required	Description
`schema_version`	`1`	yes	Must be `1`
`run_profile_id`	slug	yes	Unique identifier
`title`	string	yes	Human-readable label
`runner_defaults`	mapping	no	Default runner settings applied to every case
`model_overrides`	mapping	no	Per-model setting overrides (key = `model_id`)
`execution_policy`	`ExecutionPolicy`	no	Concurrency and error-handling controls
`openclaw`	`OpenClawRunProfile`	no	OpenClaw runtime block (agent, image, timeout)

`runner_defaults` / `model_overrides`¶

Merge order (later wins): runner_defaults → model_overrides[model_id] → case-level runner: fields.

Recognized llm_probe fields:

Field	Type	Default	Description
`temperature`	float 0–2	provider default	Sampling temperature
`top_p`	float 0–1	provider default	Nucleus sampling
`max_tokens`	int	provider default	Max tokens to generate
`seed`	int	null	Reproducibility seed
`timeout_seconds`	int	`30`	Per-request wall-clock timeout
`retries`	int	`5`	Retry attempts on transient failures
`max_turns`	int	`8`	Max tool-use turns before forcing a final response

`execution_policy`¶

Field	Type	Default	Description
`max_concurrency`	int ≥ 1	`1`	Parallel case executions
`run_repetitions`	int ≥ 1	`1`	Runs per `(model, case)`; each gets a distinct fingerprint
`fail_fast`	bool	`false`	Stop after the first case failure
`stop_on_runner_error`	bool	`true`	Stop after the first unrecoverable runner error

`openclaw` block¶

Field	Type	Required	Description
`agent_id`	slug	yes	Resolves `configs/agents/<agent_id>/`
`image`	string	yes	Pinned OCI image for the `openclaw` CLI
`timeout_seconds`	int	yes	Wall-clock timeout for the container run
`docker_cli`	string	no (default `docker`)	OCI runtime CLI (e.g. `podman`)

OpenClaw model routing

The suite model is mapped to an OpenRouter ref (openrouter/<provider>/<model>) and injected as agents.defaults.model.primary in the generated openclaw.json. Fallbacks are rejected: a benchmark run must execute against exactly one model.

Example¶

schema_version: 1
run_profile_id: standard_run
title: Standard run profile
runner_defaults:
  temperature: 0
  max_tokens: 1024
  timeout_seconds: 60
  retries: 2
  max_turns: 8
model_overrides:
  big_model:
    max_tokens: 4096
    timeout_seconds: 120
execution_policy:
  max_concurrency: 2
  run_repetitions: 3
  fail_fast: false
  stop_on_runner_error: true

Evaluation profile (`evaluation_profile.yaml`)¶

Defines the judges, their repetition plans, aggregation policies, and security controls.

Top-level fields¶

Field	Type	Required	Description
`schema_version`	`1`	yes	Must be `1`
`evaluation_profile_id`	slug	yes	Unique identifier
`title`	string	yes	Human-readable label
`judges`	list of `JudgeConfig`	no	Named judge definitions
`judge_runs`	list of `JudgeRunConfig`	no	Execution plans referencing a judge
`aggregation`	`JudgeAggregationConfig`	no	How to aggregate judge iterations
`anchors`	`AnchorsConfig`	no	Calibration anchors for the judge prompt
`security_policy`	`SecurityPolicy`	no	Execution security controls
`judge_system_prompt_path`	path	no	Path to system prompt file, relative to this YAML
`judge_system_prompt`	string	no	Inline system prompt text

judge_system_prompt_path and judge_system_prompt are mutually exclusive.

`judges` — JudgeConfig¶

Field	Type	Required	Description
`judge_id`	slug	yes	Logical name referenced by `judge_runs`
`type`	string	yes	Judge backend (`"llm_probe"`)
`model`	string	no	Model to call (e.g. `"openai/gpt-5.4-mini"`)

Any extra judge fields are passed through as OpenRouter request_options. Use this for request-level controls such as temperature or reasoning.effort:

judges:
  - judge_id: gpt54_fast_judge
    type: llm_probe
    model: openai/gpt-5.4
    request_options:
      temperature: 0.0
      reasoning:
        effort: none

`judge_runs` — JudgeRunConfig¶

Field	Type	Default	Description
`judge_run_id`	slug	—	Unique run identifier
`judge_id`	slug	—	Must match a declared judge
`repetitions`	int ≥ 1	`1`	How many times to call the judge per case
`sample_size`	int or null	`null`	Subset of repetitions used for aggregation

`aggregation` — how iterations combine¶

Field	Type	Default	Description
`method`	`"median"` / `"mean"` / `"majority_vote"` / `"all_pass"`	`"median"`	Aggregation across successful iterations
`pass_threshold`	float or null	null	Score threshold below which a dimension is considered failed

final_score

final_score comes from judge_overall.score. Deterministic checks are informative evidence for the judge and for debugging, but there is no separate weighting policy for the final score.

`anchors` — calibration examples¶

anchors:
  enabled: true
  references:
    - anchor_id: perfect_tool_use
      label: "Perfect tool-use chain"
      text: "Used all three tools in order, file contents exact, confirmation clear."

When enabled: true, anchor texts are injected into the judge prompt to help calibrate scoring.

`security_policy`¶

Field	Type	Default	Description
`allow_local_python_hooks`	bool	`false`	Allow Python hook files in test cases to execute
`network_access`	`"deny"` / `"allow"`	`"deny"`	Network access for hook execution
`redact_secrets`	bool	`true`	Strip known secret patterns from artifact payloads

Full example¶

schema_version: 1
evaluation_profile_id: judge_gpt4o_mini
title: Judge with GPT-4o mini (3 repetitions)
judge_system_prompt_path: prompts/judge_system_default.md
judges:
  - judge_id: main_judge
    type: llm_probe
    model: openai/gpt-4o-mini
judge_runs:
  - judge_run_id: main_run
    judge_id: main_judge
    repetitions: 3
aggregation:
  method: median
anchors:
  enabled: false
security_policy:
  allow_local_python_hooks: false
  network_access: deny
  redact_secrets: true

OpenClaw agent (`configs/agents/<agent_id>/`)¶

A directory-based config surface for OpenClaw benchmarks:

configs/agents/<agent_id>/
  agent.yaml
  workspace/
    AGENTS.md
    SOUL.md
    ...           ← any workspace template files

`agent.yaml` top-level fields¶

Field	Type	Required	Description
`schema_version`	`1`	yes	Must be `1`
`agent_id`	slug	yes	Must match the directory name
`title`	string	yes	Human-readable label
`description`	string	no	Optional summary
`tags`	list	no	Free tags
`openclaw`	`OpenClawFragments`	no	Fragments merged into generated `openclaw.json`

`openclaw` fragments¶

Field	Description
`identity`	Not written to `openclaw.json` (fails strict validation); keep persona in workspace files instead
`agents_defaults`	Merged into `agents.defaults`
`agent`	Used to build `agents.list[0]` (`id`, `prompt` → `systemPromptOverride`)
`model_defaults`	`aliases` mapped to `agents.defaults.models[<primary>].alias`; `fallbacks` are rejected

openclaw.agent.id is used as the agent ID passed to openclaw agent --agent. It can differ from agent_id (e.g. agent_id: support_agent with openclaw.agent.id: support-agent).

Workspace contract¶

workspace/ must exist beside agent.yaml
The harness copies it into an ephemeral temp directory before each run
Missing standard files (AGENTS.md, IDENTITY.md, SOUL.md, TOOLS.md, USER.md) are filled with deterministic placeholder content
The agent fingerprint covers the SHA-256 of every file in workspace/ — changing any workspace file invalidates all stored runs for that agent

Configuration Reference¶

Quick reference¶

Test case (test.yaml)¶

Top-level fields¶

runner¶

input¶

Message¶

MessageSource¶

OpenClaw multiturn input¶

Attachments¶

input.context.llm_probe¶

input.context.openclaw¶

expectations¶

Expectation¶

rubric¶

deterministic_checks¶

Declarative check kinds¶

PythonHook¶

Minimal test.yaml¶

Suite (suite.yaml)¶

Top-level fields¶

models — ModelConfig¶

case_selection¶

openclaw.agent_assignments¶

Example¶

Run profile (run_profile.yaml)¶

Top-level fields¶

runner_defaults / model_overrides¶

execution_policy¶

openclaw block¶

Example¶

Evaluation profile (evaluation_profile.yaml)¶

Top-level fields¶

judges — JudgeConfig¶

judge_runs — JudgeRunConfig¶

aggregation — how iterations combine¶

anchors — calibration examples¶

security_policy¶

Full example¶

OpenClaw agent (configs/agents/<agent_id>/)¶

agent.yaml top-level fields¶

openclaw fragments¶

Workspace contract¶

Test case (`test.yaml`)¶

`runner`¶

`input`¶

`input.context.llm_probe`¶

`input.context.openclaw`¶

`expectations`¶

`rubric`¶

`deterministic_checks`¶

Minimal `test.yaml`¶

Suite (`suite.yaml`)¶

`models` — ModelConfig¶

`case_selection`¶

`openclaw.agent_assignments`¶

Run profile (`run_profile.yaml`)¶

`runner_defaults` / `model_overrides`¶

`execution_policy`¶

`openclaw` block¶

Evaluation profile (`evaluation_profile.yaml`)¶

`judges` — JudgeConfig¶

`judge_runs` — JudgeRunConfig¶

`aggregation` — how iterations combine¶

`anchors` — calibration examples¶

`security_policy`¶

OpenClaw agent (`configs/agents/<agent_id>/`)¶

`agent.yaml` top-level fields¶

`openclaw` fragments¶