Skip to content

Deterministic Checks

Deterministic checks run directly against the stored RunArtifact — no LLM is involved. They are fast, stable, and free, and they produce hard pass/fail signals that the judge can use as grounded evidence.

Each check declares one or more dimensions, telling the system which parts of the evaluation the signal is relevant to.


Built-in declarative checks

All declarative checks are defined with declarative.kind in the test case.

final_response_present

Passes if the run produced a non-empty final response.

For openclaw runs, the check also passes if a readable key workspace output exists even when there is no explicit final output field.

deterministic_checks:
  - check_id: response-exists
    dimensions: [task]
    declarative:
      kind: final_response_present

tool_call_count

Passes if the run recorded exactly the expected number of tool calls.

deterministic_checks:
  - check_id: used-three-tools
    dimensions: [process, efficiency]
    declarative:
      kind: tool_call_count
      expected: 3

file_exists

Passes if a filesystem path exists and is a regular file.

deterministic_checks:
  - check_id: output-created
    dimensions: [task]
    declarative:
      kind: file_exists
      path: /tmp/output.txt

file_contains

Passes if a file exists and contains a required substring.

deterministic_checks:
  - check_id: marker-present
    dimensions: [task]
    declarative:
      kind: file_contains
      path: /tmp/output.txt
      text: expected-marker-string

path_exists

Passes if a filesystem path exists, whether it is a file or a directory.

deterministic_checks:
  - check_id: dir-created
    dimensions: [task]
    declarative:
      kind: path_exists
      path: /tmp/my_output_dir

status_is

Passes if the run's terminal status matches the expected value.

Valid status values: success, failed, timed_out, invalid, provider_error.

deterministic_checks:
  - check_id: run-succeeded
    dimensions: [process]
    declarative:
      kind: status_is
      expected: success

output_artifact_present

Passes if the run artifact records at least one output artifact reference matching the given artifact_type.

deterministic_checks:
  - check_id: key-output-present
    dimensions: [task]
    declarative:
      kind: output_artifact_present
      artifact_type: openclaw_key_output

openclaw_workspace_file_present

For openclaw runs only. Passes if a recorded output artifact resolves to a workspace file whose path ends with the given relative_path.

Content checks have two modes:

  • contains: legacy exact substring match.
  • contains_all / contains_any: normalized matching. Text is case-folded and accents are stripped, so Sebastián matches sebastian.

Do not mix contains with contains_all or contains_any in the same check.

deterministic_checks:
  - check_id: report-md-created
    dimensions: [task]
    declarative:
      kind: openclaw_workspace_file_present
      relative_path: report.md

  - check_id: report-md-has-marker
    dimensions: [process]
    declarative:
      kind: openclaw_workspace_file_present
      relative_path: report.md
      contains: openclaw-tool-example

  - check_id: report-md-has-normalized-evidence
    dimensions: [task, process]
    declarative:
      kind: openclaw_workspace_file_present
      relative_path: report.md
      contains_all: [sebastian, feedback]
      contains_any: [ignorar, ignor]

Summary table

kind Typical dimensions Notes
final_response_present task, process Works for both llm_probe and openclaw
tool_call_count process, efficiency Exact count check
file_exists task Host filesystem path
file_contains task Host filesystem path + substring
path_exists task File or directory
status_is process Matches RunStatus values
output_artifact_present task Checks RunArtifact.output_artifacts
openclaw_workspace_file_present task, process OpenClaw runs only; checks workspace diff

Python hook checks

When a check cannot be expressed declaratively, you can implement it as a Python callable. This is the escape hatch for custom logic.

deterministic_checks:
  - check_id: custom-check
    dimensions: [task]
    hook:
      path: checks/my_check.py       # relative to test.yaml
      callable_name: check_output

Or using an importable module:

deterministic_checks:
  - check_id: custom-check
    dimensions: [task]
    hook:
      import_path: mypackage.checks.output_check
      callable_name: check_output

import_path and path are mutually exclusive.

Security policy

Python hook checks are disabled by default. To enable them, set security_policy.allow_local_python_hooks: true in the evaluation profile. This setting exists because hooks execute arbitrary code during evaluation.


How dimensions feed evaluation

The dimensions list on each check tells the judge-facing evaluation context which dimension scores can be informed by that check outcome. For example:

  • final_response_present mapped to task — a missing response is strong evidence of task failure
  • tool_call_count mapped to process and efficiency — wrong tool usage is relevant process/efficiency evidence
  • openclaw_workspace_file_present mapped to task — a missing artifact is strong evidence that the task was not completed

The final dimension scores still come from the judge. Deterministic outcomes are surfaced alongside the run evidence so the judge and the human reviewer can see them clearly.

Hybrid evaluation