Run Artifacts¶
A RunArtifact is the canonical JSON record of one runner execution. It is the single source of truth for everything that happened: the input, the trace, the output, the usage, and any errors.
Everything downstream — deterministic checks, the judge, and final evaluation reporting — reads from this artifact. Nothing mutates it.
Top-level structure¶
{
"schema_version": 1,
"identity": {
"run_id": "run_001",
"case_id": "llm_probe_tool_example",
"suite_id": "llm_probe_examples",
"run_profile_id": "llm_probe_examples",
"runner_type": "llm_probe",
"run_fingerprint": "a3f8c2..."
},
"status": "success",
"request": {
"requested_model": "minimax/minimax-m2.7",
"gateway": "openrouter",
"execution_parameters": { "temperature": 0, "max_tokens": 768 }
},
"provider": {
"provider_name": "Minimax",
"provider_model_id": "minimax-m2.7",
"finish_reason": "stop"
},
"timing": {
"started_at": "2026-04-22T10:00:00Z",
"completed_at": "2026-04-22T10:00:12Z",
"duration_seconds": 12.3
},
"usage": {
"input_tokens": 580,
"output_tokens": 120,
"total_tokens": 700,
"cost_usd": 0.00018
},
"trace": [...],
"output_artifacts": [],
"runner_metadata": {}
}
Terminal statuses¶
| Status | Meaning |
|---|---|
success |
The run completed and produced a final response |
failed |
The runner encountered an error after the model responded |
timed_out |
The run exceeded timeout_seconds |
invalid |
The provider returned an invalid or unparseable response |
provider_error |
The provider returned an error (HTTP 4xx/5xx, quota exceeded, etc.) |
The trace¶
The trace field is an ordered list of events that records exactly what happened during the run. Event types:
| Kind | Description |
|---|---|
message |
A message sent to or received from the model (role + content) |
tool_call |
A tool call requested by the model (name + arguments) |
tool_result |
The result returned after executing the tool |
final_output |
The model's final response (the last assistant message before stopping) |
runner_trace |
Internal runner lifecycle events (retries, errors, etc.) |
Example trace (abbreviated):
[
{"kind": "message", "role": "user", "content": "Use real tools to..."},
{"kind": "tool_call", "tool_name": "exec_shell", "arguments": {"command": "printf 'marker\\n'"}},
{"kind": "tool_result", "tool_name": "exec_shell", "content": "marker\n", "status": "success"},
{"kind": "tool_call", "tool_name": "write_file", "arguments": {"path": "/tmp/out.txt", "content": "marker\n"}},
{"kind": "tool_result", "tool_name": "write_file", "content": "written", "status": "success"},
{"kind": "final_output", "content": "Created /tmp/out.txt with the required content."}
]
Usage fields¶
| Field | Description |
|---|---|
input_tokens |
Tokens in the prompt |
output_tokens |
Tokens in the completion |
total_tokens |
Sum of input + output |
reasoning_tokens |
Tokens used for chain-of-thought (if available) |
cached_input_tokens |
Cache-hit tokens (if available) |
cache_write_tokens |
Tokens written to cache (if available) |
cost_usd |
Estimated USD cost (from OpenRouter pricing) |
The raw provider usage payload is preserved separately under raw_provider_usage so OpenRouter-specific fields remain inspectable.
OpenClaw evidence (runner_metadata.openclaw)¶
For OpenClaw runs, the runner_metadata field contains an openclaw block with all the evidence captured from the container execution:
{
"runner_metadata": {
"openclaw": {
"agent_id": "support_agent",
"container_image": "ghcr.io/openclaw/openclaw:2026.4.15",
"openclaw_generated_config": {
"artifact_type": "openclaw_generated_config",
"uri": "file://run_1.artifacts/openclaw.json"
},
"openclaw_workspace_snapshot": {
"artifact_type": "openclaw_workspace_snapshot",
"uri": "file://run_1.artifacts/workspace.tar.gz"
},
"openclaw_workspace_diff": {
"artifact_type": "openclaw_workspace_diff",
"uri": "file://run_1.artifacts/workspace.diff"
},
"openclaw_key_output": [
{
"artifact_type": "openclaw_key_output",
"uri": "file://run_1.artifacts/report.md",
"name": "report.md"
}
],
"openclaw_logs": {
"artifact_type": "openclaw_logs",
"uri": "file://run_1.artifacts/openclaw.log"
}
}
}
}
| Evidence field | What it contains |
|---|---|
openclaw_generated_config |
The openclaw.json generated for this run |
openclaw_raw_session_trace |
The full JSON session trace from OpenClaw |
openclaw_logs |
Captured stdout/stderr logs from the container |
openclaw_workspace_snapshot |
Archive of the workspace at the end of the run |
openclaw_workspace_diff |
Diff between the template workspace and the final workspace |
openclaw_key_output |
Key output files identified by the runner (list) |
Artifact URIs use file:// references to files stored in the run_1.artifacts/ directory next to the artifact JSON.
Storage layout¶
outputs/runs/suit_{suite_id}/run_profile_{fp6}/
manifest.json ← suite-level manifest
{model_id}/
{case_id}/
manifest.json ← case-level manifest (maps repetitions → fingerprints)
run_1.json ← RunArtifact for repetition 1
run_1.artifacts/ ← external files referenced by run_1.json
run_1.fingerprint_input.json ← normalized payload used to derive the fingerprint
run_2.json ← RunArtifact for repetition 2 (if run_repetitions > 1)
run_2.artifacts/
run_2.fingerprint_input.json
The case-level manifest.json maps repetition indices to their full run fingerprints. It also records runner_type and, for OpenClaw runs, openclaw_agent_id, so stored campaigns are inspectable without opening each run_N.json.
Evaluation output layout¶
Evaluation results live under a separate tree, scoped by both the run-profile fingerprint and the evaluation fingerprint:
outputs/evaluations/suit_{suite_id}/
evaluation_profile_{run_fp6}/
eval_profile_{eval_id}_{eval_fp6}/
manifest.json ← campaign-level manifest
fingerprint_input.json ← evaluation fingerprint input
{model_id}/
{case_id}/
manifest.json
evaluation_result_summary_1.md ← human-readable summary (start here)
judge_1.prompt.debug.md ← exact prompt shown to the judge
raw_outputs/
judge_1.json ← aggregated judge result
judge_1.prompt.user.json ← structured subject-view payload
final_result_1.json ← FinalEvaluationResult (hybrid score)
Recommended reading order:
evaluation_result_summary_1.md— the verdictjudge_1.prompt.debug.md— what the judge sawraw_outputs/final_result_1.json— technical details and dimension resolution