Judge Prompt Redesign¶
Problem¶
The current judge prompt embeds a large JSON payload with:
judge_namejudge_modelcase_contextdeterministic_summary- a mostly raw
run_artifact
This creates three issues:
- The judge sees implementation and infrastructure details that are not the subject of the evaluation.
- The prompt spends too many tokens on low-signal metadata.
- The judge-facing view is not stored as a first-class artifact, even though it is the real evaluation input.
Design Goals¶
- Make the subject of evaluation explicit.
- Remove irrelevant infrastructure noise.
- Preserve enough evidence for process scoring.
- Reduce prompt size and ambiguity.
- Persist exactly what the judge saw in
outputs. - Keep campaign-level reproducibility and fingerprint stability.
Proposed Prompt Shape¶
The judge should still receive two messages:
system: scoring contract and output schemauser: a compact, evaluation-oriented JSON document
The user payload should be reshaped to this:
{
"schema_version": 2,
"summary": "A concise overall assessment of the run.",
"evaluation_target": {
"task_messages": [
{"content": "..."}
],
"expectations": {
"hard": ["..."],
"soft": ["..."]
},
"dimensions": ["task", "process", "autonomy", "closeness", "efficiency", "spark"]
},
"subject_response": {
"final_output": "...",
"assistant_messages": ["..."],
"tool_activity_summary": {
"tool_call_count": 0,
"tools_used": []
}
},
"execution_evidence": {
"deterministic_summary": {
"passed_checks": 0,
"failed_checks": 0,
"error_checks": 0,
"total_checks": 0
},
"key_trace_events": [
{"kind": "message", "role": "assistant", "content": "..."},
{"kind": "tool_call", "tool_name": "..."},
{"kind": "tool_result", "tool_name": "...", "status": "success"}
],
"artifacts": [
{"artifact_type": "key_output", "basename": "report.md", "excerpt": "..."}
],
"material_failures": [
{"stage": "execution", "message": "..."}
]
},
"dimensions": {
"task": {"evidence": ["..."], "score": 7.5},
"process": {"evidence": ["..."], "score": 7.0},
"autonomy": {"evidence": ["..."], "score": 6.5},
"closeness": {"evidence": ["..."], "score": 7.0},
"efficiency": {"evidence": ["..."], "score": 6.0},
"spark": {"evidence": ["..."], "score": 5.5}
}
}
Explicit Removals¶
The judge prompt should not include these fields:
judge_namejudge_modelrun_artifact.identity.run_idrun_artifact.identity.run_profile_idrun_artifact.identity.suite_idrun_artifact.request.gatewayrun_artifact.request.execution_parametersrun_artifact.request.metadata.requested_runner_configrun_artifact.request.metadata.openclaw.container_imagerun_artifact.providerrun_artifact.timingrun_artifact.usage- raw
runner_metadata - low-level Docker/config/bootstrap logs by default
- local filesystem paths such as
source_path, temp dirs, andfile://refs
These fields are useful for debugging and storage, but not as primary evaluation input.
What The Judge Should See Instead¶
1. Evaluation target¶
This section should contain only the task definition:
- user/system messages from the case input
- expectation text
- declared dimensions
2. Subject response¶
This section should contain only the outcome produced by the evaluated system:
- terminal run status
- final output if present
- assistant messages if relevant
- small tool-use summary
This makes the evaluated object obvious and separate from runner metadata.
3. Execution evidence¶
This section should contain only evidence needed for process scoring:
- deterministic summary
- filtered trace events
- key output excerpts
- material failures that prevented task completion
This preserves process visibility without drowning the judge in runner internals.
Filtering Rules¶
Keep¶
- user-visible assistant content
- final output content
- tool calls and tool results when they are semantically relevant
- key generated files and short excerpts
- task-blocking failures
- a compact summary of deterministic checks
Drop by default¶
- config generation details
- container image names
- raw environment forwarding details
- workspace materialization events
- registry pull errors unless they directly explain why no answer exists
- repeated stderr noise
- null-heavy metadata blocks
- duplicate evidence already represented elsewhere
OpenClaw-specific rule¶
Replace openclaw_judge_context with a normalized execution_evidence summary:
material_failureskey_trace_eventsartifacts
Do not pass raw excerpts like generated_openclaw_config_excerpt or large log chunks unless the
run failed before any model output and the failure itself is the only evaluable evidence.
Prompt Storage¶
Persist what the judge saw as first-class artifacts next to judge_N.json.
Recommended files:
judge_1.prompt.debug.mdraw_outputs/judge_1.prompt.user.json
raw_outputs/judge_1.json should remain the raw provider result record.
The new prompt files should be derived from the exact sent payload, not rebuilt later.
Reproducibility¶
The judge-facing user payload should be fingerprinted.
Recommended additions:
- store a hash of
judge_prompt_user.jsonin the evaluation record - optionally store a hash of
judge_prompt_system.txt - keep
judge_system_promptin the campaign manifest as today
This preserves the ability to answer two different questions:
- "What instructions governed this evaluation campaign?"
- "What exact evidence packet did this judge invocation see?"
Migration Path¶
Phase 1¶
- Keep current raw
raw_outputs/judge_1.json - Add persisted prompt files and a human-readable summary
- Introduce a builder that produces a normalized judge-facing subject view
- Keep old internal run artifact storage unchanged
Phase 2¶
- Switch the judge prompt from raw
run_artifactto the new normalized view - Update tests to assert field inclusion/exclusion
- Add fixture coverage for successful runs and infrastructure failures
Phase 3¶
- Optionally remove
judge_nameandjudge_modelfrom the prompt entirely - Optionally make trace filtering runner-specific via adapters
Recommended First Implementation¶
The first implementation should do four things only:
- Stop embedding
judge_nameandjudge_modelin theuserprompt payload. - Replace
run_artifactin the prompt with a newjudge_subject_view. - Filter OpenClaw evidence down to response, key outputs, relevant trace, and material failures.
- Persist
judge_N.prompt.debug.md,raw_outputs/judge_N.prompt.user.json, and anevaluation_result_summary_N.mdunderoutputs/evaluations/....
This gives a meaningful reduction in prompt noise without forcing a full storage redesign.