Skip to content

Fingerprints & Reuse

personal_agent_eval uses SHA-256 fingerprints to identify runs and evaluations. This is the mechanism that makes campaigns incremental and reproducible: no combination is re-executed unless its inputs have changed.


What is a fingerprint?

A fingerprint is a deterministic SHA-256 hash computed from a normalized JSON payload of all the inputs that matter for one execution. If the inputs are the same, the hash is the same. If anything relevant changes, the hash changes.

There are three kinds of fingerprints:

Kind What it identifies
Run fingerprint One (model, case, run_profile, repetition) combination
Evaluation fingerprint One (evaluation_profile) configuration
OpenClaw agent fingerprint One (agent_config, workspace content) bundle

What goes into a run fingerprint?

The run fingerprint payload covers everything that determines what a model will receive and how it will execute:

Field Source
runner_type test.yaml → runner.type
requested_model suite model entry
runner_config resolved execution parameters (temperature, max_tokens, timeout, retries, etc.)
input_messages test.yaml → input.messages (content + role, normalized)
input_context test.yaml → input.context (tool list, openclaw context, etc.)
attachments file SHA-256 + byte size (content-addressed, not path-addressed)
case_metadata test.yaml → metadata
repetition_index the repetition number (0-based)
openclaw_agent_fingerprint the agent+workspace fingerprint (OpenClaw only)

The hash is a SHA-256 of the canonical JSON serialization of these fields. Floating-point values in the fingerprint input are not rounded to keep the hash stable.


What goes into an evaluation fingerprint?

The evaluation fingerprint covers everything that determines how runs are judged:

Field Source
judges judge model, type, and all settings
judge_runs repetitions per judge
judge_aggregation aggregation method (e.g. median)
anchors scoring anchors if enabled
security_policy redaction settings, allowed hooks
judge_system_prompt the fingerprint of the system prompt file

What changes the fingerprint?

Run fingerprint changes when you change:

  • temperature, max_tokens, timeout_seconds, max_turns, retries
  • model ID or gateway
  • case input messages or context
  • attachment file contents
  • case metadata
  • the workspace template for OpenClaw agents

Run fingerprint does NOT change when you:

  • add a new case to the suite (the new case gets its own fingerprint; existing cases are unaffected)
  • add a new model to the suite
  • change the suite title or metadata
  • increase run_repetitions (each repetition has its own fingerprint; only new ones are computed)

Evaluation fingerprint changes when you change:

  • the judge model
  • the number of judge repetitions
  • aggregation settings
  • the judge system prompt file

Storage paths and fp6

Artifacts are stored under paths that include the first 6 characters of the fingerprint (fp6):

outputs/runs/suit_{suite_id}/run_profile_{fp6}/
outputs/evaluations/suit_{suite_id}/evaluation_profile_{fp6}/eval_profile_{eval_id}_{fp6}/

When you change the run profile in a way that changes the fingerprint, a new directory is created. The old directory — and all its results — is preserved. You can always go back and compare.


The reuse decision

Before executing any (model, case, repetition) combination, the workflow computes the expected fingerprint and checks whether a matching artifact exists in storage. The outcome is one of three actions:

Action Meaning
reuse_all Run and evaluation artifacts exist and match; nothing is executed
reuse_run_only Run artifact matches; evaluation is missing or changed → only evaluate
execute_new_run Run artifact is missing or changed → execute run and then evaluate

The RUN and EVAL columns in the CLI output (reuse / exec) reflect this decision for each row.


The fingerprint input file

Every stored run artifact has a companion run_1.fingerprint_input.json file. This file records the exact normalized payload that was hashed to produce the fingerprint. It is useful for:

  • understanding exactly what the framework considered when deciding to reuse a result
  • debugging unexpected re-executions (compare the stored payload with the current config)
  • audit trails: the fingerprint input is what you'd need to reproduce the exact same run

Example:

{
  "fingerprint_version": 1,
  "hash_algorithm": "sha256",
  "kind": "run",
  "fingerprint": "a3f8...",
  "payload": {
    "runner_type": "llm_probe",
    "requested_model": "minimax/minimax-m2.7",
    "runner_config": {
      "temperature": 0,
      "max_tokens": 768,
      "timeout_seconds": 90
    },
    "input_messages": [
      {"role": "user", "content": "Use real tools to..."}
    ],
    "input_context": {
      "llm_probe": {"tools": ["exec_shell", "write_file", "read_file"]}
    },
    "attachments": [],
    "case_metadata": {}
  }
}

How to force a re-run

The framework never re-executes unless it has to. To trigger a re-run:

Delete the artifact:

# force one specific case+model combination
rm outputs/runs/suit_<suite_id>/run_profile_<fp6>/<model_id>/<case_id>/run_1.json
rm outputs/runs/suit_<suite_id>/run_profile_<fp6>/<model_id>/<case_id>/run_1.fingerprint_input.json

Change the run profile:

Modify any execution parameter (e.g. bump temperature from 0 to 0.1). The fingerprint changes, a new directory is created, and all cases re-run. The old directory is untouched.

Use a narrow suite:

Create a suite with only the cases you want to re-run. They land in a different suit_<id> directory from the original campaign.

CLI reference — re-running specific cases


OpenClaw agent fingerprint

For OpenClaw runs, the agent fingerprint covers:

  • agent_id
  • the full agent.yaml content (identity, model defaults, sandbox settings)
  • the SHA-256 and size of every file in workspace/

This means: if you change SOUL.md or AGENTS.md in the workspace template, the agent fingerprint changes, and all OpenClaw runs that use that agent get a new run fingerprint and will re-execute.

The agent fingerprint is embedded into the run fingerprint, so a changed workspace is enough to invalidate all stored results.


Fingerprint stability guarantees

  • Fingerprints are stable across Python versions and platforms because they hash a canonical JSON serialization, not a Python object.
  • Floating-point values are not rounded in fingerprint payloads (only in reporting output).
  • The fingerprint_version: 1 field in the stored payload allows future migrations if the hashing scheme ever changes.