Deterministic Checks¶

Deterministic checks run directly against the stored RunArtifact — no LLM is involved. They are fast, stable, and free, and they produce hard pass/fail signals that the judge can use as grounded evidence.

Each check declares one or more dimensions, telling the system which parts of the evaluation the signal is relevant to.

Built-in declarative checks¶

All declarative checks are defined with declarative.kind in the test case.

`final_response_present`¶

Passes if the run produced a non-empty final response.

For openclaw runs, the check also passes if a readable key workspace output exists even when there is no explicit final output field.

deterministic_checks:
  - check_id: response-exists
    dimensions: [task]
    declarative:
      kind: final_response_present

`tool_call_count`¶

Passes if the run recorded exactly the expected number of tool calls.

deterministic_checks:
  - check_id: used-three-tools
    dimensions: [process, efficiency]
    declarative:
      kind: tool_call_count
      expected: 3

`file_exists`¶

Passes if a filesystem path exists and is a regular file.

deterministic_checks:
  - check_id: output-created
    dimensions: [task]
    declarative:
      kind: file_exists
      path: /tmp/output.txt

`file_contains`¶

Passes if a file exists and contains a required substring.

deterministic_checks:
  - check_id: marker-present
    dimensions: [task]
    declarative:
      kind: file_contains
      path: /tmp/output.txt
      text: expected-marker-string

`path_exists`¶

Passes if a filesystem path exists, whether it is a file or a directory.

deterministic_checks:
  - check_id: dir-created
    dimensions: [task]
    declarative:
      kind: path_exists
      path: /tmp/my_output_dir

`status_is`¶

Passes if the run's terminal status matches the expected value.

Valid status values: success, failed, timed_out, invalid, provider_error.

deterministic_checks:
  - check_id: run-succeeded
    dimensions: [process]
    declarative:
      kind: status_is
      expected: success

`output_artifact_present`¶

Passes if the run artifact records at least one output artifact reference matching the given artifact_type.

deterministic_checks:
  - check_id: key-output-present
    dimensions: [task]
    declarative:
      kind: output_artifact_present
      artifact_type: openclaw_key_output

`openclaw_workspace_file_present`¶

For openclaw runs only. Passes if a recorded output artifact resolves to a workspace file whose path ends with the given relative_path.

Content checks have two modes:

contains: legacy exact substring match.
contains_all / contains_any: normalized matching. Text is case-folded and accents are stripped, so Sebastián matches sebastian.

Do not mix contains with contains_all or contains_any in the same check.

deterministic_checks:
  - check_id: report-md-created
    dimensions: [task]
    declarative:
      kind: openclaw_workspace_file_present
      relative_path: report.md

  - check_id: report-md-has-marker
    dimensions: [process]
    declarative:
      kind: openclaw_workspace_file_present
      relative_path: report.md
      contains: openclaw-tool-example

  - check_id: report-md-has-normalized-evidence
    dimensions: [task, process]
    declarative:
      kind: openclaw_workspace_file_present
      relative_path: report.md
      contains_all: [sebastian, feedback]
      contains_any: [ignorar, ignor]

Summary table¶

`kind`	Typical dimensions	Notes
`final_response_present`	`task`, `process`	Works for both `llm_probe` and `openclaw`
`tool_call_count`	`process`, `efficiency`	Exact count check
`file_exists`	`task`	Host filesystem path
`file_contains`	`task`	Host filesystem path + substring
`path_exists`	`task`	File or directory
`status_is`	`process`	Matches `RunStatus` values
`output_artifact_present`	`task`	Checks `RunArtifact.output_artifacts`
`openclaw_workspace_file_present`	`task`, `process`	OpenClaw runs only; checks workspace diff

Python hook checks¶

When a check cannot be expressed declaratively, you can implement it as a Python callable. This is the escape hatch for custom logic.

deterministic_checks:
  - check_id: custom-check
    dimensions: [task]
    hook:
      path: checks/my_check.py       # relative to test.yaml
      callable_name: check_output

Or using an importable module:

deterministic_checks:
  - check_id: custom-check
    dimensions: [task]
    hook:
      import_path: mypackage.checks.output_check
      callable_name: check_output

import_path and path are mutually exclusive.

Security policy

Python hook checks are disabled by default. To enable them, set security_policy.allow_local_python_hooks: true in the evaluation profile. This setting exists because hooks execute arbitrary code during evaluation.

How dimensions feed evaluation¶

The dimensions list on each check tells the judge-facing evaluation context which dimension scores can be informed by that check outcome. For example:

final_response_present mapped to task — a missing response is strong evidence of task failure
tool_call_count mapped to process and efficiency — wrong tool usage is relevant process/efficiency evidence
openclaw_workspace_file_present mapped to task — a missing artifact is strong evidence that the task was not completed

The final dimension scores still come from the judge. Deterministic outcomes are surfaced alongside the run evidence so the judge and the human reviewer can see them clearly.

→ Hybrid evaluation

Deterministic Checks¶

Built-in declarative checks¶

final_response_present¶

tool_call_count¶

file_exists¶

file_contains¶

path_exists¶

status_is¶

output_artifact_present¶

openclaw_workspace_file_present¶