$ agent-duel --task "ship the feature"

Agent Duel

One task. Many AI coding agents. May the best PR win. Stop choosing coding agents from generic benchmarks — give them your actual task, in your actual repo, and compare the results with real evidence.

runs locally bring your own agents & keys never proxies inference

// example run

agent-duel
$ agent-duel --repo . --task "Add subscription cancellation + Stripe webhook + tests"

AGENT        STATUS    TESTS   FILES  COST    TIME
claude-code  complete  48/48   6      $7.20   19m
codex        partial   46/48   5      $4.80   13m
gemini-cli   partial   32/48   9      $2.10   11m

Winner: claude-code  (all tests pass, fewest stray files)

Instead of asking "which coding agent is best?" you find out "which agent is best for this exact task in our codebase?"

Why it exists

Public coding-agent leaderboards — SWE-bench, Terminal-Bench and friends — tell you how agents do on someone else's tasks. They can't tell you whether an agent can handle your architecture, your conventions, your obscure legacy code, or your one weird framework.

And the agent vendors will never neutrally show you that a competitor did the job better. Agent Duel's neutrality is the whole point: it runs the agents you choose, on your task, and reports concrete evidence — passing tests, diffs, cost and time — not a vague AI opinion.

How it works

  1. Same starting line. Each agent gets its own isolated git worktree branched from the same base commit, so they never step on each other.
  2. Same task. Your task prompt is handed to every agent unchanged.
  3. Real verification. After each agent finishes, your test command runs against its result.
  4. Scored on evidence. Results are ranked: tests passing → fewer files touched → lower cost → faster.
  5. Inspect everything. You get a side-by-side table plus duel.md and duel.json reports, and can keep the branches to read the actual diffs.

Bring your own agents

An "agent" is just a shell command you configure. Agent Duel never proxies inference and never sees your API keys — the agents run locally under your own credentials, so there are no token costs or remote-code-execution risks on anyone else's servers. It works with any tool that edits files in a repo: Claude Code, Codex, Aider, and others.

Quickstart

You'll need the repo, a Go toolchain to build the CLI, and at least one coding-agent CLI with its API key set.

1. Build the CLI

git clone https://github.com/glglak/logs-of-a-thinking-machine.git
cd logs-of-a-thinking-machine/agent-duel
go build -o agent-duel .

2. Describe the duel in an agent-duel.json. Each agent is a command; {{TASK}} is replaced with your task text:

{
  "testCommand": "npm test",
  "timeout": "15m",
  "agents": [
    {
      "name": "claude-sonnet",
      "command": "aider --model openrouter/anthropic/claude-3.7-sonnet --message \"{{TASK}}\" --yes --no-stream"
    },
    {
      "name": "deepseek",
      "command": "aider --model openrouter/deepseek/deepseek-chat --message \"{{TASK}}\" --yes --no-stream"
    }
  ]
}

The example above duels two models through a single Aider + OpenRouter key — a cheap way to try it. Swap in claude, codex, or any other agent CLI just as easily.

3. Run the duel from inside your project:

agent-duel \\
  --repo . \\
  --config agent-duel.json \\
  --task "Implement password reset with email tokens and tests" \\
  --keep

--keep preserves the per-agent branches so you can review each diff. Reports land in agent-duel-results/.

Honest caveats

  • Agents are non-deterministic. A single run is a signal, not a verdict — re-run for confidence.
  • Passing tests ≠ good code. The diff is there precisely so you can judge quality yourself.
  • Garbage in, garbage out. Good tasks and a real test suite make the comparison meaningful.

Agent Duel is open source and built in the same spirit as everything else here — logs of a thinking machine, made legible. Star it, fork it, or read the source.