Super Mario Replica

I built a Super Mario–style platformer in Godot over about a week. The game itself is a replica — run-and-jump, coins, blocks, the usual beats. The real experiment was the workflow: a full requirement spec up front, a Cursor vibe code environment, and a custom command that let the agent self-review the code, decide the next step, run tests, run e2e blackbox checks, and keep track of steps in a SQLite database. The repo has a detailed README for anyone who wants to run or extend it.

How I ran the experiment

I gave the project a full requirement spec using spec-kit so the agent had a clear contract before writing a single line. Then I set up the Cursor vibe code environment and added a command that orchestrated the loop: the agent would self-review the code, choose the next action, run unit tests, run e2e tests as a blackbox, and persist progress in a SQLite DB. The idea was to compress the distance between "agent wrote something" and "we know it works" — and to give the agent enough context (via memory) that it could make reasonable next-step decisions without me in the loop every time.

What worked

The setup caught a lot of bugs that would have shown up later in manual play or in user reports. Having the agent run tests and e2e checks as part of its own loop meant regressions and integration issues surfaced early. The spec acted as a shared source of truth; the SQLite-backed memory meant the agent could refer to what had already been done. In practice, the combination of spec-kit, vibe coding, and these commands did eliminate much of the need for constant human intervention. It was a strong signal that the approach is viable.

Where I had to step in

It worked to a certain extent, but not end-to-end without me. There were points where the agent would have gone in circles or missed a decision that required product or design judgment. I had to steer — clarify priorities, correct course, or unblock when the command set didn't cover a scenario. The gap wasn't the agent's capability so much as the coverage of commands and decision rules. Some situations still need a human to say "do this next" or "we're done with that."

Takeaway

It was a great experiment. The pipeline proved that structured specs plus agent commands for review, testing, and memory can take you most of the way. I didn't need to babysit every change. What's left is filling in the gaps: more commands, clearer decision points, and better handling of edge cases. I'm convinced that with enough of those, the need for a human operator in the loop could eventually disappear.

Super Mario Replica

The Motivation

The Problem

Key Learnings

How I ran the experiment

What worked

Where I had to step in

Takeaway

Interested in discussing this further?