personal_agent_eval¶

personal_agent_eval is an open evaluation framework for LLM and agent-based systems. It is designed to be clear, reproducible, and easy to extend — whether you are a developer benchmarking a model for the first time or an agent consuming structured evaluation data.

What does it do?¶

It takes a set of test cases (what to ask), a model or agent (who to ask), and an evaluation policy (how to judge the answer), and produces a structured, reproducible score.

Every run is:

Reproducible — a SHA-256 fingerprint tracks exactly what was executed. Re-running the same configuration reuses the stored result instead of spending tokens again.
Transparent — every artifact is a plain JSON or Markdown file you can inspect directly: the run trace, the judge's exact prompt, the score breakdown, the cost split.
Incremental — adding a new case or a new model to a campaign only runs the missing combinations. Nothing already computed is touched.

Two runner modes¶

Mode	What gets evaluated	How it runs
`llm_probe`	A raw LLM completion endpoint with optional tool use	Direct HTTP call to OpenRouter
`openclaw`	A full autonomous agent running inside a Docker container	`docker run` with the pinned OpenClaw image

Both modes share the same config schema, the same evaluation pipeline, and the same output format.

The evaluation pipeline¶

  test cases + suite
        │
        ▼
    pae run      ──→  RunArtifact (raw trace, token usage, tool calls)
        │
        ▼
  deterministic  ──→  per-check pass/fail  (no LLM, always stable)
   evaluation
        │
        ▼
    judge        ──→  6-dimension scores + evidence  (LLM judge via OpenRouter)
        │
        ▼
  aggregation    ──→  FinalEvaluationResult  (hybrid score 0–10)
        │
        ▼
   reporting     ──→  terminal tables · JSON · optional PNG chart

Quick start¶

# install
uv sync --group dev

# set your OpenRouter API key
export OPENROUTER_API_KEY=sk-or-...

# run the shipped llm_probe example campaign
uv run pae run-eval \
  --suite llm_probe_examples \
  --run-profile llm_probe_examples \
  --evaluation-profile judge_gpt54_mini

→ Full getting started guide

Documentation map¶

I want to…	Go to
Run my first benchmark	Getting started
Understand the 4 config files	Config model
See the shipped example configs	Runnable examples
Write my own test cases	Configuration reference
Understand how scores work	Concepts
Understand fingerprints and reuse	Fingerprints & reuse
Use the CLI effectively	CLI reference
Debug a specific evaluation	Judge results
Understand the output files	Run artifacts

Repository layout¶

configs/          ← your YAML configs: cases, suites, run profiles, eval profiles, agents
src/              ← Python source (personal_agent_eval package)
tests/            ← pytest test suite (203 tests, all mocked)
docs/             ← this documentation
outputs/          ← generated at runtime; not committed to git

Open source

This is a public library. If you find a bug, want to add a deterministic check, or have a new example campaign to contribute, open an issue or PR. Every module is a self-contained layer with explicit inputs and outputs — the code is meant to be read.