Controlled benchmark · Claude Opus 4.8

Does writing less code cost you quality?

A head-to-head test of the ponytail skill — which pushes an AI agent to write the most minimal code that works — against the same agent with no skill at all, across 12 real build tasks from backend to frontend.

The verdict

Baselinethe agent with no skill. Builds the task as it normally would.
Ponytailthe same agent, same task, with the ponytail "lazy senior dev" skill injected.
Efficiency

Ponytail keeps its core promise: roughly half the code, for roughly half the generation cost.

Lower bars are better here. Ponytail (terracotta) is shorter almost everywhere — most dramatically on larger UI and algorithm tasks. The one exception is the trivial health-check API, where ponytail's rule of "always leave a runnable check behind" added a second test file.

Lines of code written
non-blank, non-comment lines per task
Generation output tokens
model output tokens spent producing each build, sampled live from the workflow budget
Quality

The trade is real: leaner code scored about one point lower, and the gap is widest exactly where it matters.

Two independent judges scored every pair blind (folders relabelled A/B) on a 1–10 scale across seven dimensions. Scores below are averaged across both judges and all 12 tasks. Readability is nearly a tie — ponytail's code is just as clear, there's simply less of it — but robustness and production-readiness fall off sharply.

Seven-dimension profile
average score, all tasks · further out is better
Score by dimension
baseline vs ponytail, with the gap
Head-to-head verdicts
across 24 blind judgments (2 judges × 12 tasks) — who did each judge pick?
By domain

Where ponytail wins, and where it shouldn't be trusted unsupervised.

The skill's value is not uniform. On frontend work it strips the most code with the smallest quality hit. On data/CLI tasks it cut the most corners — missing edge cases that cost real points. On the pure refactor task, both agents converged on the same one-line answer, so the skill made almost no difference.

DomainTasksCode savedTokens savedQuality (baseline → ponytail)
Every task, in detail

All 12 builds — open any one for scores, bugs, judge notes, code and screenshots.

Ordered simple → complex. Each row un-blinds the A/B comparison, shows the objective metrics, the bugs each arm's judges actually found, one judge's head-to-head summary, and the full source of both versions.

So — should you use it?

A sharp tool, pointed at the right job.

How this was measured (and what it can't tell you)