A head-to-head test of the ponytail skill — which pushes an AI agent to write the most minimal code that works — against the same agent with no skill at all, across 12 real build tasks from backend to frontend.
Lower bars are better here. Ponytail (terracotta) is shorter almost everywhere — most dramatically on larger UI and algorithm tasks. The one exception is the trivial health-check API, where ponytail's rule of "always leave a runnable check behind" added a second test file.
Two independent judges scored every pair blind (folders relabelled A/B) on a 1–10 scale across seven dimensions. Scores below are averaged across both judges and all 12 tasks. Readability is nearly a tie — ponytail's code is just as clear, there's simply less of it — but robustness and production-readiness fall off sharply.
The skill's value is not uniform. On frontend work it strips the most code with the smallest quality hit. On data/CLI tasks it cut the most corners — missing edge cases that cost real points. On the pure refactor task, both agents converged on the same one-line answer, so the skill made almost no difference.
| Domain | Tasks | Code saved | Tokens saved | Quality (baseline → ponytail) |
|---|
The three frontend tasks produced real, openable pages. These are live screenshots of each build. Notice how close they often are: ponytail's versions are leaner under the hood yet still look professional — the differences are in polish and edge cases, not basic competence.
Ordered simple → complex. Each row un-blinds the A/B comparison, shows the objective metrics, the bugs each arm's judges actually found, one judge's head-to-head summary, and the full source of both versions.