Skip to content

CLI Reference

personal_agent_eval exposes a single entrypoint called pae. Inside a repository checkout, run it via uv:

uv run pae --help

If you have installed the package into an active virtualenv, you can also run pae directly.


Global flags

Flag Default Description
--version Print the package version and exit
--log-level INFO Logging verbosity: DEBUG, INFO, WARNING, ERROR, CRITICAL

Commands

pae run

Executes the runs defined by a suite and run profile. Skips any (model, case, repetition) combination that already has a matching stored fingerprint.

uv run pae run \
  --suite <id-or-path> \
  --run-profile <id-or-path>

When to use it: When you want to collect raw run artifacts without spending tokens on evaluation. Useful when you plan to evaluate later with multiple judge profiles, or when you want to inspect run_1.json before committing to a full evaluation.

What it writes: RunArtifact JSON files under outputs/runs/.

Flags:

Flag Required Description
--suite yes Suite ID or explicit YAML path
--run-profile yes Run profile ID or explicit YAML path
--output no text (default) or json

pae eval

Runs any missing runs and evaluates all results. If a run already exists, it is reused; if an evaluation already exists for that run + evaluation profile, it is also reused.

uv run pae eval \
  --suite <id-or-path> \
  --run-profile <id-or-path> \
  --evaluation-profile <id-or-path>

When to use it: The standard command for an evaluation campaign. Handles both missing runs and missing evaluations in one shot.

What it writes: RunArtifact JSON files under outputs/runs/, evaluation results under outputs/evaluations/, and optionally a chart PNG under outputs/charts/.

Flags:

Flag Required Description
--suite yes Suite ID or explicit YAML path
--run-profile yes Run profile ID or explicit YAML path
--evaluation-profile yes Evaluation profile ID or explicit YAML path
--output no text (default) or json
--no-chart no Skip writing the score/cost PNG chart
--chart PATH no Write the chart to a custom path
--chart-footnote TEXT no Optional caption at the bottom of the chart

pae run-eval

Identical to pae eval. Both commands run missing runs first and then evaluate. The name is kept as an alias for clarity in scripts.

uv run pae run-eval \
  --suite llm_probe_examples \
  --run-profile llm_probe_examples \
  --evaluation-profile judge_gpt54_mini

pae report

Reads already stored artifacts and renders the report without executing anything. No tokens are spent. Useful for regenerating a report or chart after tweaking the evaluation profile configuration (if you already have stored evaluations matching the new fingerprint).

uv run pae report \
  --suite <id-or-path> \
  --run-profile <id-or-path> \
  --evaluation-profile <id-or-path>

When to use it: When results are already stored and you just want to see the terminal output or regenerate the chart.

Same flags as pae eval.


Config ID resolution

All --suite, --run-profile, and --evaluation-profile flags accept either a config ID or an explicit path:

# using IDs (auto-resolved from conventional directories)
pae run-eval \
  --suite llm_probe_examples \
  --run-profile llm_probe_examples \
  --evaluation-profile judge_gpt54_mini

# using explicit paths
pae run-eval \
  --suite configs/suites/llm_probe_examples.yaml \
  --run-profile configs/run_profiles/llm_probe_examples.yaml \
  --evaluation-profile configs/evaluation_profiles/judge_gpt54_mini.yaml

An ID is resolved by looking for <id>.yaml or <id>.yml under the conventional directory:

Flag Conventional directory
--suite configs/suites/
--run-profile configs/run_profiles/
--evaluation-profile configs/evaluation_profiles/

The workspace root is derived from the location of the --suite file: the CLI walks up from configs/suites/ to find the parent directory that contains configs/.


Output formats

Terminal (default)

The default --output text mode renders human-readable tables:

Suite:             llm_probe_examples
Run profile:       llm_probe_examples
Evaluation:        judge_gpt54_mini
Run cost:          $0.0003
Evaluation cost:   $0.0018
Total cost:        $0.0021

MODEL          CASE                        RUN   EVAL  SCORE  LATENCY_S  RUN_COST  EVAL_COST  TOTAL_COST
─────────────  ──────────────────────────  ────  ────  ─────  ─────────  ────────  ─────────  ──────────
minimax_m27    llm_probe_tool_example      exec  exec   8.50      12.3    $0.0001   $0.0009    $0.0010
minimax_m27    llm_probe_browser_example   exec  exec   7.00       9.1    $0.0002   $0.0009    $0.0011

SUMMARY
Model          Cases  Avg Score  Avg Latency  Run Cost  Eval Cost  Total Cost
─────────────  ─────  ─────────  ───────────  ────────  ─────────  ──────────
minimax_m27    2       7.75      10.7s        $0.0003   $0.0018    $0.0021

The RUN and EVAL columns show either exec (newly computed) or reuse (loaded from storage).

JSON

Use --output json to get a machine-readable structured report on stdout:

uv run pae run-eval \
  --suite llm_probe_examples \
  --run-profile llm_probe_examples \
  --evaluation-profile judge_gpt54_mini \
  --output json

Chart status messages go to stderr, so the JSON on stdout stays valid and pipeable.


The score/cost chart

For pae eval, pae run-eval, and pae report, the CLI writes a PNG chart by default:

  • Default path: outputs/charts/<evaluation_profile_id>/score_cost.png
  • Each bubble is one model; X axis = total cost, Y axis = mean score, bubble size = mean latency
  • Labels are drawn next to the bubbles rather than inside them; the renderer includes small overrides for crowded OpenClaw benchmark labels.
  • Requires the optional [charts] extra: uv sync --extra charts
# custom chart path
uv run pae run-eval ... --chart /tmp/results.png

# add a caption
uv run pae run-eval ... --chart-footnote "Run date: 2026-04-22"

# skip the chart entirely
uv run pae run-eval ... --no-chart

Re-running specific cases

The framework reuses stored results by fingerprint. To force a re-run of specific cases, you have two options:

Option 1: Delete the stored run artifact

# force re-run of one specific case for one model
rm outputs/runs/suit_<suite_id>/run_profile_<fp6>/<model_id>/<case_id>/run_1.json
rm outputs/runs/suit_<suite_id>/run_profile_<fp6>/<model_id>/<case_id>/run_1.fingerprint_input.json

# then re-run the campaign
uv run pae run-eval ...

The workflow detects the missing artifact and re-executes only that case.

Option 2: Create a narrow suite

Create a temporary suite with case_selection.include_case_ids limited to the cases you want to re-run:

# configs/suites/rerun_tool_case.yaml
schema_version: 1
suite_id: rerun_tool_case
title: "Targeted re-run"
models:
  - model_id: minimax_m27
    requested_model: minimax/minimax-m2.7
case_selection:
  include_case_ids:
    - llm_probe_tool_example
uv run pae run-eval \
  --suite rerun_tool_case \
  --run-profile llm_probe_examples \
  --evaluation-profile judge_gpt54_mini

This creates a separate campaign directory under outputs/runs/suit_rerun_tool_case/ — the original campaign is untouched.

Option 3: Change the run profile

Any change to a run profile field that affects execution (e.g. temperature, max_tokens) creates a new fingerprint and a new campaign directory. All cases are re-run, and the previous results are preserved under the old directory.

Fingerprints & reuse — full explanation


Environment variables

Variable Description
OPENROUTER_API_KEY Required for all LLM calls (runs and judge)
PERSONAL_AGENT_EVAL_RUN_OPENROUTER_E2E=1 Opt-in to the real OpenRouter smoke test
PERSONAL_AGENT_EVAL_OPENROUTER_E2E_RUN_MODEL Override run model in the e2e test
PERSONAL_AGENT_EVAL_OPENROUTER_E2E_JUDGE_MODEL Override judge model in the e2e test
PERSONAL_AGENT_EVAL_OPENCLAW_DOCKER_FULL_ENV=1 Forward the full host env to OpenClaw containers

The .env file at the repository root is loaded automatically by the CLI.


Checking version

uv run pae --version