CLI Reference¶

personal_agent_eval exposes a single entrypoint called pae. Inside a repository checkout, run it via uv:

uv run pae --help

If you have installed the package into an active virtualenv, you can also run pae directly.

Global flags¶

Flag	Default	Description
`--version`	—	Print the package version and exit
`--log-level`	`INFO`	Logging verbosity: `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL`

Commands¶

`pae run`¶

Executes the runs defined by a suite and run profile. Skips any (model, case, repetition) combination that already has a matching stored fingerprint.

uv run pae run \
  --suite <id-or-path> \
  --run-profile <id-or-path>

When to use it: When you want to collect raw run artifacts without spending tokens on evaluation. Useful when you plan to evaluate later with multiple judge profiles, or when you want to inspect run_1.json before committing to a full evaluation.

What it writes: RunArtifact JSON files under outputs/runs/.

Flags:

Flag	Required	Description
`--suite`	yes	Suite ID or explicit YAML path
`--run-profile`	yes	Run profile ID or explicit YAML path
`--output`	no	`text` (default) or `json`

`pae eval`¶

Runs any missing runs and evaluates all results. If a run already exists, it is reused; if an evaluation already exists for that run + evaluation profile, it is also reused.

uv run pae eval \
  --suite <id-or-path> \
  --run-profile <id-or-path> \
  --evaluation-profile <id-or-path>

When to use it: The standard command for an evaluation campaign. Handles both missing runs and missing evaluations in one shot.

What it writes: RunArtifact JSON files under outputs/runs/, evaluation results under outputs/evaluations/, and optionally a chart PNG under outputs/charts/.

Flags:

Flag	Required	Description
`--suite`	yes	Suite ID or explicit YAML path
`--run-profile`	yes	Run profile ID or explicit YAML path
`--evaluation-profile`	yes	Evaluation profile ID or explicit YAML path
`--output`	no	`text` (default) or `json`
`--no-chart`	no	Skip writing the score/cost PNG chart
`--chart PATH`	no	Write the chart to a custom path
`--chart-footnote TEXT`	no	Optional caption at the bottom of the chart

`pae run-eval`¶

Identical to pae eval. Both commands run missing runs first and then evaluate. The name is kept as an alias for clarity in scripts.

uv run pae run-eval \
  --suite llm_probe_examples \
  --run-profile llm_probe_examples \
  --evaluation-profile judge_gpt54_mini

`pae report`¶

Reads already stored artifacts and renders the report without executing anything. No tokens are spent. Useful for regenerating a report or chart after tweaking the evaluation profile configuration (if you already have stored evaluations matching the new fingerprint).

uv run pae report \
  --suite <id-or-path> \
  --run-profile <id-or-path> \
  --evaluation-profile <id-or-path>

When to use it: When results are already stored and you just want to see the terminal output or regenerate the chart.

Same flags as pae eval.

Config ID resolution¶

All --suite, --run-profile, and --evaluation-profile flags accept either a config ID or an explicit path:

# using IDs (auto-resolved from conventional directories)
pae run-eval \
  --suite llm_probe_examples \
  --run-profile llm_probe_examples \
  --evaluation-profile judge_gpt54_mini

# using explicit paths
pae run-eval \
  --suite configs/suites/llm_probe_examples.yaml \
  --run-profile configs/run_profiles/llm_probe_examples.yaml \
  --evaluation-profile configs/evaluation_profiles/judge_gpt54_mini.yaml

An ID is resolved by looking for <id>.yaml or <id>.yml under the conventional directory:

Flag	Conventional directory
`--suite`	`configs/suites/`
`--run-profile`	`configs/run_profiles/`
`--evaluation-profile`	`configs/evaluation_profiles/`

The workspace root is derived from the location of the --suite file: the CLI walks up from configs/suites/ to find the parent directory that contains configs/.

Output formats¶

Terminal (default)¶

The default --output text mode renders human-readable tables:

Suite:             llm_probe_examples
Run profile:       llm_probe_examples
Evaluation:        judge_gpt54_mini
Run cost:          $0.0003
Evaluation cost:   $0.0018
Total cost:        $0.0021

MODEL          CASE                        RUN   EVAL  SCORE  LATENCY_S  RUN_COST  EVAL_COST  TOTAL_COST
─────────────  ──────────────────────────  ────  ────  ─────  ─────────  ────────  ─────────  ──────────
minimax_m27    llm_probe_tool_example      exec  exec   8.50      12.3    $0.0001   $0.0009    $0.0010
minimax_m27    llm_probe_browser_example   exec  exec   7.00       9.1    $0.0002   $0.0009    $0.0011

SUMMARY
Model          Cases  Avg Score  Avg Latency  Run Cost  Eval Cost  Total Cost
─────────────  ─────  ─────────  ───────────  ────────  ─────────  ──────────
minimax_m27    2       7.75      10.7s        $0.0003   $0.0018    $0.0021

The RUN and EVAL columns show either exec (newly computed) or reuse (loaded from storage).

JSON¶

Use --output json to get a machine-readable structured report on stdout:

uv run pae run-eval \
  --suite llm_probe_examples \
  --run-profile llm_probe_examples \
  --evaluation-profile judge_gpt54_mini \
  --output json

Chart status messages go to stderr, so the JSON on stdout stays valid and pipeable.

The score/cost chart¶

For pae eval, pae run-eval, and pae report, the CLI writes a PNG chart by default:

Default path: outputs/charts/<evaluation_profile_id>/score_cost.png
Each bubble is one model; X axis = total cost, Y axis = mean score, bubble size = mean latency
Labels are drawn next to the bubbles rather than inside them; the renderer includes small overrides for crowded OpenClaw benchmark labels.
Requires the optional [charts] extra: uv sync --extra charts

# custom chart path
uv run pae run-eval ... --chart /tmp/results.png

# add a caption
uv run pae run-eval ... --chart-footnote "Run date: 2026-04-22"

# skip the chart entirely
uv run pae run-eval ... --no-chart

Re-running specific cases¶

The framework reuses stored results by fingerprint. To force a re-run of specific cases, you have two options:

Option 1: Delete the stored run artifact¶

# force re-run of one specific case for one model
rm outputs/runs/suit_<suite_id>/run_profile_<fp6>/<model_id>/<case_id>/run_1.json
rm outputs/runs/suit_<suite_id>/run_profile_<fp6>/<model_id>/<case_id>/run_1.fingerprint_input.json

# then re-run the campaign
uv run pae run-eval ...

The workflow detects the missing artifact and re-executes only that case.

Option 2: Create a narrow suite¶

Create a temporary suite with case_selection.include_case_ids limited to the cases you want to re-run:

# configs/suites/rerun_tool_case.yaml
schema_version: 1
suite_id: rerun_tool_case
title: "Targeted re-run"
models:
  - model_id: minimax_m27
    requested_model: minimax/minimax-m2.7
case_selection:
  include_case_ids:
    - llm_probe_tool_example

uv run pae run-eval \
  --suite rerun_tool_case \
  --run-profile llm_probe_examples \
  --evaluation-profile judge_gpt54_mini

This creates a separate campaign directory under outputs/runs/suit_rerun_tool_case/ — the original campaign is untouched.

Option 3: Change the run profile¶

Any change to a run profile field that affects execution (e.g. temperature, max_tokens) creates a new fingerprint and a new campaign directory. All cases are re-run, and the previous results are preserved under the old directory.

→ Fingerprints & reuse — full explanation

Environment variables¶

Variable	Description
`OPENROUTER_API_KEY`	Required for all LLM calls (runs and judge)
`PERSONAL_AGENT_EVAL_RUN_OPENROUTER_E2E=1`	Opt-in to the real OpenRouter smoke test
`PERSONAL_AGENT_EVAL_OPENROUTER_E2E_RUN_MODEL`	Override run model in the e2e test
`PERSONAL_AGENT_EVAL_OPENROUTER_E2E_JUDGE_MODEL`	Override judge model in the e2e test
`PERSONAL_AGENT_EVAL_OPENCLAW_DOCKER_FULL_ENV=1`	Forward the full host env to OpenClaw containers

The .env file at the repository root is loaded automatically by the CLI.

Checking version¶

uv run pae --version