Ponytail Benchmark — does writing less code cost quality?

The verdict

Baseline — the agent with no skill. Builds the task as it normally would.

Ponytail — the same agent, same task, with the ponytail "lazy senior dev" skill injected.

Efficiency

Ponytail keeps its core promise: roughly half the code, for roughly half the generation cost.

Lower bars are better here. Ponytail (terracotta) is shorter almost everywhere — most dramatically on larger UI and algorithm tasks. The one exception is the trivial health-check API, where ponytail's rule of "always leave a runnable check behind" added a second test file.

Lines of code written

non-blank, non-comment lines per task

Generation output tokens

model output tokens spent producing each build, sampled live from the workflow budget

Quality

The trade is real: leaner code scored about one point lower, and the gap is widest exactly where it matters.

Two independent judges scored every pair blind (folders relabelled A/B) on a 1–10 scale across seven dimensions. Scores below are averaged across both judges and all 12 tasks. Readability is nearly a tie — ponytail's code is just as clear, there's simply less of it — but robustness and production-readiness fall off sharply.

Seven-dimension profile

average score, all tasks · further out is better

Score by dimension

baseline vs ponytail, with the gap

Head-to-head verdicts

across 24 blind judgments (2 judges × 12 tasks) — who did each judge pick?

By domain

Where ponytail wins, and where it shouldn't be trusted unsupervised.

The skill's value is not uniform. On frontend work it strips the most code with the smallest quality hit. On data/CLI tasks it cut the most corners — missing edge cases that cost real points. On the pure refactor task, both agents converged on the same one-line answer, so the skill made almost no difference.

Domain	Tasks	Code saved	Tokens saved	Quality (baseline → ponytail)

Visual output

Same brief, two philosophies — rendered.

The three frontend tasks produced real, openable pages. These are live screenshots of each build. Notice how close they often are: ponytail's versions are leaner under the hood yet still look professional — the differences are in polish and edge cases, not basic competence.

Every task, in detail

All 12 builds — open any one for scores, bugs, judge notes, code and screenshots.

Ordered simple → complex. Each row un-blinds the A/B comparison, shows the objective metrics, the bugs each arm's judges actually found, one judge's head-to-head summary, and the full source of both versions.

So — should you use it?

Does writing less code cost you quality?

Ponytail keeps its core promise: roughly half the code, for roughly half the generation cost.

The trade is real: leaner code scored about one point lower, and the gap is widest exactly where it matters.

Where ponytail wins, and where it shouldn't be trusted unsupervised.

Same brief, two philosophies — rendered.

All 12 builds — open any one for scores, bugs, judge notes, code and screenshots.

A sharp tool, pointed at the right job.

How this was measured (and what it can't tell you)