Skip to content

Configuration Reference

All YAML files must declare schema_version: 1 at the top. The loader raises an explicit ConfigError on any schema violation — there are no silent defaults for required fields.


Quick reference

Config type Default path ID field
Test case configs/cases/<case_id>/test.yaml or grouped as configs/cases/<group>/<case_id>/test.yaml case_id
Suite configs/suites/<suite_id>.yaml suite_id
Run profile configs/run_profiles/<profile_id>.yaml run_profile_id
Evaluation profile configs/evaluation_profiles/<profile_id>.yaml evaluation_profile_id
OpenClaw agent configs/agents/<agent_id>/agent.yaml + workspace/ agent_id

CLI flags (--suite, --run-profile, --evaluation-profile) accept either an explicit YAML path or just the plain ID, which is resolved automatically under the conventional directory.


Test case (test.yaml)

Defines one atomic evaluation scenario. The same case can be included in multiple suites and evaluated against multiple models without modification.

Top-level fields

Field Type Required Description
schema_version 1 yes Must be 1
case_id slug yes Unique identifier; used as directory name in outputs/
title string yes Human-readable label
runner RunnerConfig yes Runner type and optional case-level overrides
input TestInput yes Messages, attachments, and runner context
expectations Expectations no Hard/soft expectation list for the judge
rubric Rubric no Scored anchors and criteria shown to the judge
deterministic_checks list no Checks run against the RunArtifact without an LLM
tags list of strings no Free tags; used by case_selection.include_tags / exclude_tags
metadata mapping no Arbitrary key/value; not used by the framework

runner

Field Type Required Description
type "llm_probe" or "openclaw" yes Selects the runner implementation

Any extra fields (e.g. temperature: 0) are treated as case-level runner overrides and take precedence over run_profile.runner_defaults but are overridden by run_profile.model_overrides.


input

Field Type Default Description
messages list of Message [] Ordered message sequence sent to the model; for OpenClaw multiturn cases, optional initial context
turns list of Message [] OpenClaw-only user turns executed as separate agent invocations in one session
attachments list of paths [] Local files injected as extra user messages
context mapping {} Runner-specific context: tools, openclaw hints, etc.

All relative paths in input resolve relative to the test.yaml file.

Message

Field Type Required Description
role "system" / "user" / "assistant" / "tool" yes Message role
content string one of content / source Inline message text
source MessageSource one of content / source External file reference
name string no Optional name annotation

content and source are mutually exclusive.

MessageSource

source:
  path: messages/user_prompt.yaml   # relative to test.yaml
  format: yaml                      # optional; auto-detected from extension

The referenced file may resolve to a single message object or a list of messages.

OpenClaw multiturn input

For runner.type: openclaw, input.messages preserves the existing single-turn behavior: the messages are rendered into one OpenClaw --message invocation. To test follow-up user messages, use input.turns. Each turn is sent through openclaw agent --session-id <run-session> using the same ephemeral workspace and OPENCLAW_STATE_DIR.

input:
  messages:
    - role: system
      content: Keep context across user turns.
  turns:
    - role: user
      content: Create draft.md.
    - role: user
      content: Revise draft.md and create report.md.
  context:
    openclaw:
      expected_artifact: report.md

When turns is present, messages is treated as initial context and included with the first turn only. Final workspace checks run after the last successful turn, and the raw session trace records all turn payloads.

Attachments

Files in input.attachments are injected as additional user messages just before the first user message:

Attached context file: <filename>

--- BEGIN ATTACHMENT <filename> ---
<file content>
--- END ATTACHMENT <filename> ---

Attachments are content-addressed in the fingerprint (SHA-256 + byte size), not path-addressed.

input.context.llm_probe

Field Type Description
tools list of strings Tools to expose to the model: exec_shell, write_file, read_file, web_search, etc.

input.context.openclaw

Field Type Description
expected_artifact string Hint for the harness: the filename of the expected output in the workspace

expectations

Shown to the judge as part of the evaluation target. Hard expectations are treated as critical requirements; soft ones are scored with partial credit.

Field Type Description
hard_expectations list of Expectation Requirements the judge must treat as critical
soft_expectations list of Expectation Requirements scored with partial credit

Expectation

Field Type Required Description
text string yes Natural-language expectation statement
weight float no (default 1.0) Relative weight within the group

rubric

An optional structured rubric shown to the judge to calibrate scoring. It replaces free-form judge guessing with explicit anchors and criteria.

rubric:
  version: 1
  scale:
    min: 0
    max: 10
    anchors:
      "10": All required steps completed correctly and concisely.
      "7": Mostly correct with minor clarity issues.
      "4": Partial completion; missing a required step.
      "0": No attempt, irrelevant output, or empty.
  criteria:
    - name: Tool-grounded correctness
      what_good_looks_like: Uses required tools and reports observed results.
      what_bad_looks_like: Invents results or skips required steps.
    - name: Concise confirmation
      what_good_looks_like: Confirms actions in 2–4 lines.
      what_bad_looks_like: Overly verbose or unclear.
  scoring_instructions: >
    Use this rubric to set overall.score. Cap at ≤ 4 if a hard
    expectation or deterministic check fails.

deterministic_checks

Each entry is a DeterministicCheck:

Field Type Required Description
check_id slug yes Unique within the case
dimensions list of dimension names no Dimensions this check affects in aggregation
declarative DeclarativeCheck one of Built-in check specification
python_hook PythonHook one of Custom Python callable

declarative and python_hook are mutually exclusive.

Valid dimension names: task, process, autonomy, closeness, efficiency, spark.

Declarative check kinds

kind Extra fields Description
final_response_present Non-empty final output in the trace (also checks last assistant message or workspace output for openclaw)
tool_call_count count (int) Exact tool call count
file_exists path File exists on host filesystem
file_contains path, text File exists and contains substring
path_exists path Path exists (file or directory)
status_is status Terminal run status matches
output_artifact_present artifact_type? Run artifact records a matching output artifact
openclaw_workspace_file_present relative_path, contains?, contains_all?, contains_any? Workspace diff contains the file (OpenClaw only); contains_all / contains_any use normalized case/accent-insensitive matching

Paths resolve relative to test.yaml.

For openclaw_workspace_file_present, use either legacy contains or the normalized contains_all / contains_any fields. The loader rejects checks that mix both styles.

PythonHook

python_hook:
  path: hooks/my_check.py        # relative to test.yaml (mutually exclusive with import_path)
  # import_path: my_pkg.checks   # dotted import path (mutually exclusive with path)
  callable_name: check_output

Warning

Python hooks are disabled by default. Enable them with security_policy.allow_local_python_hooks: true in the evaluation profile.


Minimal test.yaml

schema_version: 1
case_id: my_case
title: My case
runner:
  type: llm_probe
input:
  messages:
    - role: user
      content: What is 2 + 2?
expectations:
  hard_expectations:
    - text: Answers with 4.
deterministic_checks:
  - check_id: response-present
    dimensions: [task]
    declarative:
      kind: final_response_present

Suite (suite.yaml)

Groups models with a case selection policy to form a benchmark campaign.

Top-level fields

Field Type Required Description
schema_version 1 yes Must be 1
suite_id slug yes Unique identifier
title string yes Human-readable label
models list of ModelConfig no Models to run against the selected cases
case_selection CaseSelection no Filters determining which cases are included
openclaw SuiteOpenClawConfig no Optional per-case OpenClaw agent assignments
metadata mapping no Arbitrary annotation bag

models — ModelConfig

Field Type Required Description
model_id slug yes Local name used in artifact paths and reports
label string no Human-readable display name
requested_model string no OpenRouter model string (e.g. openai/gpt-4o-mini)

Any extra fields are passed through to the runner. The llm_probe runner resolves the model in this priority order: requested_modelopenrouter_modelprovider + model_namemodel_id.

For OpenClaw campaigns, extra model fields can also carry OpenRouter model parameters. The current benchmark uses primary_params.reasoning.effort to control GPT-5 reasoning on the primary model, for example:

models:
  - model_id: gpt55
    requested_model: openai/gpt-5.5
    primary_params:
      reasoning:
        effort: medium

case_selection

Field Type Description
include_case_ids list of strings Explicit case IDs to include
exclude_case_ids list of strings Explicit case IDs to exclude
include_tags list of strings Include cases that have at least one of these tags
exclude_tags list of strings Exclude cases that have at least one of these tags

Precedence: include_case_ids > tag filters > exclude_case_ids. Unknown case IDs in include_case_ids are a hard error.

openclaw.agent_assignments

Use this when one suite should run different OpenClaw cases with different reusable agents. Each assignment selects cases by ID and/or tag and points them at configs/agents/<agent_id>/. Cases that match no assignment use run_profile.openclaw.agent_id as the default. A case that matches more than one assignment is rejected before execution.

openclaw:
  agent_assignments:
    - agent_id: agent_1
      case_selection:
        include_case_ids: [case_a, case_b, case_c, case_d]
    - agent_id: agent_2
      case_selection:
        include_case_ids: [case_e, case_f]
    - agent_id: agent_3
      case_selection:
        include_tags: [long_context]
        exclude_case_ids: [case_f]

Example

schema_version: 1
suite_id: my_suite
title: My benchmark suite
models:
  - model_id: gpt4o_mini
    requested_model: openai/gpt-4o-mini
    label: GPT-4o mini
  - model_id: minimax_m27
    requested_model: minimax/minimax-m2.7
case_selection:
  include_tags: [smoke]
  exclude_case_ids: [known_flaky_case]
openclaw:
  agent_assignments:
    - agent_id: support_agent
      case_selection:
        include_tags: [support_agent]

Run profile (run_profile.yaml)

Controls execution behavior. A SHA-256 fingerprint of the effective settings scopes campaign storage directories — changing any execution parameter produces a new fingerprint and a new directory.

Top-level fields

Field Type Required Description
schema_version 1 yes Must be 1
run_profile_id slug yes Unique identifier
title string yes Human-readable label
runner_defaults mapping no Default runner settings applied to every case
model_overrides mapping no Per-model setting overrides (key = model_id)
execution_policy ExecutionPolicy no Concurrency and error-handling controls
openclaw OpenClawRunProfile no OpenClaw runtime block (agent, image, timeout)

runner_defaults / model_overrides

Merge order (later wins): runner_defaultsmodel_overrides[model_id] → case-level runner: fields.

Recognized llm_probe fields:

Field Type Default Description
temperature float 0–2 provider default Sampling temperature
top_p float 0–1 provider default Nucleus sampling
max_tokens int provider default Max tokens to generate
seed int null Reproducibility seed
timeout_seconds int 30 Per-request wall-clock timeout
retries int 5 Retry attempts on transient failures
max_turns int 8 Max tool-use turns before forcing a final response

execution_policy

Field Type Default Description
max_concurrency int ≥ 1 1 Parallel case executions
run_repetitions int ≥ 1 1 Runs per (model, case); each gets a distinct fingerprint
fail_fast bool false Stop after the first case failure
stop_on_runner_error bool true Stop after the first unrecoverable runner error

openclaw block

Field Type Required Description
agent_id slug yes Resolves configs/agents/<agent_id>/
image string yes Pinned OCI image for the openclaw CLI
timeout_seconds int yes Wall-clock timeout for the container run
docker_cli string no (default docker) OCI runtime CLI (e.g. podman)

OpenClaw model routing

The suite model is mapped to an OpenRouter ref (openrouter/<provider>/<model>) and injected as agents.defaults.model.primary in the generated openclaw.json. Fallbacks are rejected: a benchmark run must execute against exactly one model.

Example

schema_version: 1
run_profile_id: standard_run
title: Standard run profile
runner_defaults:
  temperature: 0
  max_tokens: 1024
  timeout_seconds: 60
  retries: 2
  max_turns: 8
model_overrides:
  big_model:
    max_tokens: 4096
    timeout_seconds: 120
execution_policy:
  max_concurrency: 2
  run_repetitions: 3
  fail_fast: false
  stop_on_runner_error: true

Evaluation profile (evaluation_profile.yaml)

Defines the judges, their repetition plans, aggregation policies, and security controls.

Top-level fields

Field Type Required Description
schema_version 1 yes Must be 1
evaluation_profile_id slug yes Unique identifier
title string yes Human-readable label
judges list of JudgeConfig no Named judge definitions
judge_runs list of JudgeRunConfig no Execution plans referencing a judge
aggregation JudgeAggregationConfig no How to aggregate judge iterations
anchors AnchorsConfig no Calibration anchors for the judge prompt
security_policy SecurityPolicy no Execution security controls
judge_system_prompt_path path no Path to system prompt file, relative to this YAML
judge_system_prompt string no Inline system prompt text

judge_system_prompt_path and judge_system_prompt are mutually exclusive.

judges — JudgeConfig

Field Type Required Description
judge_id slug yes Logical name referenced by judge_runs
type string yes Judge backend ("llm_probe")
model string no Model to call (e.g. "openai/gpt-5.4-mini")

Any extra judge fields are passed through as OpenRouter request_options. Use this for request-level controls such as temperature or reasoning.effort:

judges:
  - judge_id: gpt54_fast_judge
    type: llm_probe
    model: openai/gpt-5.4
    request_options:
      temperature: 0.0
      reasoning:
        effort: none

judge_runs — JudgeRunConfig

Field Type Default Description
judge_run_id slug Unique run identifier
judge_id slug Must match a declared judge
repetitions int ≥ 1 1 How many times to call the judge per case
sample_size int or null null Subset of repetitions used for aggregation

aggregation — how iterations combine

Field Type Default Description
method "median" / "mean" / "majority_vote" / "all_pass" "median" Aggregation across successful iterations
pass_threshold float or null null Score threshold below which a dimension is considered failed

final_score

final_score comes from judge_overall.score. Deterministic checks are informative evidence for the judge and for debugging, but there is no separate weighting policy for the final score.

anchors — calibration examples

anchors:
  enabled: true
  references:
    - anchor_id: perfect_tool_use
      label: "Perfect tool-use chain"
      text: "Used all three tools in order, file contents exact, confirmation clear."

When enabled: true, anchor texts are injected into the judge prompt to help calibrate scoring.

security_policy

Field Type Default Description
allow_local_python_hooks bool false Allow Python hook files in test cases to execute
network_access "deny" / "allow" "deny" Network access for hook execution
redact_secrets bool true Strip known secret patterns from artifact payloads

Full example

schema_version: 1
evaluation_profile_id: judge_gpt4o_mini
title: Judge with GPT-4o mini (3 repetitions)
judge_system_prompt_path: prompts/judge_system_default.md
judges:
  - judge_id: main_judge
    type: llm_probe
    model: openai/gpt-4o-mini
judge_runs:
  - judge_run_id: main_run
    judge_id: main_judge
    repetitions: 3
aggregation:
  method: median
anchors:
  enabled: false
security_policy:
  allow_local_python_hooks: false
  network_access: deny
  redact_secrets: true

OpenClaw agent (configs/agents/<agent_id>/)

A directory-based config surface for OpenClaw benchmarks:

configs/agents/<agent_id>/
  agent.yaml
  workspace/
    AGENTS.md
    SOUL.md
    ...           ← any workspace template files

agent.yaml top-level fields

Field Type Required Description
schema_version 1 yes Must be 1
agent_id slug yes Must match the directory name
title string yes Human-readable label
description string no Optional summary
tags list no Free tags
openclaw OpenClawFragments no Fragments merged into generated openclaw.json

openclaw fragments

Field Description
identity Not written to openclaw.json (fails strict validation); keep persona in workspace files instead
agents_defaults Merged into agents.defaults
agent Used to build agents.list[0] (id, promptsystemPromptOverride)
model_defaults aliases mapped to agents.defaults.models[<primary>].alias; fallbacks are rejected

openclaw.agent.id is used as the agent ID passed to openclaw agent --agent. It can differ from agent_id (e.g. agent_id: support_agent with openclaw.agent.id: support-agent).

Workspace contract

  • workspace/ must exist beside agent.yaml
  • The harness copies it into an ephemeral temp directory before each run
  • Missing standard files (AGENTS.md, IDENTITY.md, SOUL.md, TOOLS.md, USER.md) are filled with deterministic placeholder content
  • The agent fingerprint covers the SHA-256 of every file in workspace/ — changing any workspace file invalidates all stored runs for that agent