Capo

Capo started because our weekly football game was descending into WhatsApp chaos and someone had to sort it out.

For years it was just spreadsheets, half-baked macros and me insisting we could "just track it properly". Over the last year it somehow turned into something that looks suspiciously like a real product. It handles payments. It isolates clubs from each other. It runs background jobs. It has native apps. It has the sort of plumbing you only notice when it breaks.

I built it in my spare time. I didn't hand-write any of the code myself.

When I tell people that, they don't usually ask what the app does. They ask what prompts I used.

There's this assumption that the trick is wording — that if you phrase things correctly, the machine turns into a senior engineer who's seen everything before and quietly does the right thing.

I understand the appeal. Prompting is visible. It's the bit you can screenshot. It feels like the craft.

It just hasn't been the main event for me.

If I had to compress the last year into one slightly uncomfortable realisation, it would be this:

Prompting is the accelerator pedal. Judgment is the job.

I didn't start there. I ended up there.

The 30% illusion

The first phase of building with AI is oddly intoxicating.

You describe a screen and it appears. You ask for a feature and it roughly works. You tell it to tidy the UI and suddenly it looks like something that could plausibly exist in the world. For a while you feel like you've hacked time.

It's very easy in that phase to believe you're nearly done.

What took me longer than I'd like to admit is that this early momentum is misleading because the first chunk of work is the most visible chunk. It's the bit people associate with "building an app": UI, pages, routes, something you can click around.

For me, that was maybe a third of the total effort. The rest was the duller, slower work of making it behave properly when users do things out of order, change their minds, systems retry events, networks drop, two actions collide, and when the app has to be correct even when I'm not watching it.

In other words: the part where it stops being "works on my machine" and becomes "works for anyone".

That's where the time went. Not because it's impossibly hard, but because it's consequence-heavy. Every shortcut you take there has a way of turning into a future week you don't want.

A small side-note: the AI didn't just help with code — it also nudged me toward a stack that turned out to be very forgiving for this way of building. I ended up on Vercel, Supabase and Render and, so far, they've been solid.

The AI is a world-class yes-man

The models are astonishing at producing output. They're also extremely accommodating.

They don't say, "Are you sure?"

They don't say, "That assumption seems risky."

They don't say, "This will be painful later."

They'll implement what you ask for, and if what you ask for has gaps — as most human instructions do — they'll bridge those gaps with something that sounds plausible.

This is where my relationship with "prompting" changed.

At the beginning, I thought the trick was to get better at telling the AI what to do.

Later, I realised the real failure mode wasn't bad code. It was unmade decisions.

I'd vaguely intend a behaviour, assume it was obvious, and the AI would pick one interpretation and harden it into reality. Then, days later, I'd run into the edge case and realise I hadn't actually decided what should happen at all. The model hadn't forgotten anything. It had simply done the thing I'd allowed it to do: fill in the blanks.

Execution is now the easy part. The responsibility hasn't gone anywhere — it's just moved up a layer.

Where my time actually goes

If you watched a highlight reel of someone "vibe coding", you'd assume the work is mostly prompts and code appearing.

My reality looked more like this:

I'd spend the majority of a feature cycle debating the behaviour with a model in normal English until it stopped feeling fuzzy. Only then would I ask for a structured Markdown spec. After that, the AI would generate the implementation quite quickly, and then I'd spend a long stretch testing, cleaning up and discovering the places where reality hadn't matched my neat description.

The rough split I've found myself repeating is:

~60% spec and decision work
~5% actual AI code generation
~35% verification, edge cases, tidying, and fixing the places where the implementation was technically correct but practically fragile

That 60% doesn't mean me typing essays alone in a room. It's mostly back-and-forth discussion with a model until the behaviour feels genuinely resolved, and then capturing it properly so it can't drift.

That 60% is, in practice, the job. It's not glamorous. It's also the bit that makes the "speed" real later, because once the behaviour is nailed down, the AI can execute without improvising.

The Project Brain

The way I cope with this is what I've come to call the "Project Brain". It's just a set of Markdown documents, but without it things start drifting surprisingly quickly.

It has three layers.

Standards are the short rules that apply everywhere — the non-negotiables. They stop the AI from quietly creating six different patterns for the same problem. They don't try to explain everything; they point. If you touch payments, load the payments spec. If you touch tenancy, load the tenancy rules.

Specs are where the app becomes real before it becomes code. They aren't one-off design documents. They're living records that become the as-built truth, including why we chose approach A instead of approach B and the gotchas discovered along the way. That "why" matters more than I expected. Without it, you find yourself re-litigating old decisions because the context has vanished.

The index is the map. Once you have more than a handful of specs, you can't remember where anything lives, and you can't afford to shove the entire document universe into every prompt. The index tells me what exists, how big it is, and when it should be included. The goal is simple: load what you need, not everything you own.

It sounds bureaucratic until you've watched a model confidently do the wrong thing because the right information was technically present — just not visible.

The LLM Council

At some point I realised I was instinctively treating the models less like a single assistant and more like a slightly argumentative committee.

I later discovered there's an actual concept and tooling around this idea — which was mildly reassuring. It turns out asking multiple models to critique and stress-test each other isn't just me being paranoid. I'm not using that tool directly; my version is much more manual. But the instinct is the same: don't let one confident answer settle the question too quickly.

I've learned not to trust a single model with big specs. Not because models are stupid, but because they're persuasive. A single model can feel "done" when it's simply settled into one interpretation.

In practice, this means I start in plain English, chatting with one model about the feature the way I'd talk it through with a human: what we're trying to achieve, what would feel fair, what could go wrong, where I'm uncertain. I'll go back and forth for a while until it feels like we've genuinely argued the behaviour into shape rather than just described a happy path.

Only then do I ask for something formal: "Right — turn that into a structured Markdown spec." I paste that into Cursor, and that's where the more adversarial part begins.

From there, I use the model in my coding environment to critique that spec against the actual codebase and the existing Project Brain docs. Then I'll bring the revised version back to the first model and ask it to attack it again: what's missing, what breaks, what edge cases haven't been nailed down.

When merging feedback, I give the coding model very specific instructions: decide whether feedback is real risk or spec-creep, don't bloat the spec with revision history, and if there's a real-world decision involved, ask me instead of guessing.

And then I repeat. Again and again.

For big features — payments is the obvious example — that can take days. Sometimes the better part of a week.

A slightly odd detail: if a model stops finding issues, I'll open a completely fresh chat window and ask again. For reasons I don't fully understand, a new session often surfaces a different class of critique.

None of this is thrilling. It is, however, the difference between a feature that works in a demo and one that keeps working once the world starts poking it.

From "works for me" to "works for anyone"

A lot of "AI can build apps now" content focuses on visible progress. The hard bit, for me at least, began when I realised I was making promises.

Multi-tenancy is a promise: one club should never see another club's data. Not once, not briefly, not because someone forgot to scope a query. Retrofitting that guarantee touches far more than you expect — tables, routes, background jobs, and all the little assumptions that were harmless when there was only one tenant.

Payments are another set of promises. "Add Stripe" sounds like a feature. In reality it's a long conversation with failure states: webhook retries, money moving but the UI not updating, late cancellations, refund failures, multiple people trying to grab the last slot.

Capo ended up handling proper platform-style payment flows with webhook integrity and recovery logic because the "nice" version of payments is the version that breaks first.

And then there's auth.

Auth was the gift that kept on giving — if that gift was a dog turd that randomly reappears around the house. You think you've dealt with it, then you step on it again: white screens, missing tenant IDs, something subtle and infuriating.

This is the pattern: the UI is the fun third. The other two-thirds are a slow process of removing ambiguity from the system.

Yes, anyone can build anything now (and that's the point)

I do genuinely believe the "anyone can build now" idea.

I'm not saying that to be provocative. I'm saying it because I did it: a non-coder, no hand-written code, and yet a multi-tenant SaaS with payments and native apps exists at the end of it.

But "possible" and "easy" are very different things.

Right now, what's made it possible for me isn't secret prompts. It's the unglamorous discipline of debating behaviour properly, revising it relentlessly, and maintaining enough structured context that the AI doesn't have to guess.

Which leads to the other part of my view — the part that's easy to misstate.

We're in a temporary period where you still have to do that manual structuring yourself. The tools haven't abstracted it away yet. But they're improving quickly enough that a lot of this scaffolding is likely to become less visible over time.

At the moment, you still need a Project Brain. You still need a council. You still need to do the behavioural thinking explicitly.

I don't think that will remain true in this exact form.

The timing point, properly stated

This is the nuance I care about.

My argument isn't "ship faster" or "don't overthink prompts".

It's that we're living in a short window where building is newly accessible but still requires manual structure and stamina. If you're willing to do that structured work now, you can build things that would historically have required a team.

That advantage won't stay in its current form forever.

The build side will continue to be democratised. More one-click paths will appear. More of the annoying plumbing will be packaged up. Some of the markdown-and-council scaffolding will disappear behind interfaces.

So if you have an idea, the reason to build now isn't because it's effortless. It's because you need to reach distribution while the playing field is still uneven.

And yes, distribution itself will increasingly be assisted and automated over time. Not in a magic way — just in the same direction of travel. Waiting for a perfect moment doesn't really work, because the ground keeps shifting under you.

So the strategy, for me, is not dramatic. It's fairly plain: build now, accept the manual discipline, use structure to survive it, and push towards real users while the barrier is still moving.

Practical takeaways (without turning into a listicle)

If someone asked what actually helped, it would be roughly this.

I realised prompts weren't really the craft. They were just the interface. The craft was deciding behaviour clearly.

I debated features with a model in plain English before formalising them.

I made models review each other because a single model is too comfortable settling.

I kept a living Project Brain because without written "why", you end up paying for the same decision twice.

And I accepted that the boring backend work is most of the build — especially if you want "works for anyone" instead of "works for me".

That's it. No magic words. Just a lot of slightly tedious clarity.

Closing

I didn't build Capo because I became a programmer overnight. I built it because the gate is gone.

But the disappearance of the gate doesn't remove the need for judgment. If anything, it makes your decisions matter more. The machine will happily accelerate in any direction you point it. It won't ask whether you've thought through the junction.

Prompting is the accelerator pedal. Judgment is the job.

Dive deeper

Engineering Challenges I Encountered

Production lessons (tenancy, race conditions, background jobs, performance, and more).

Appendix: The Receipts

Where the time actually went

~60% spec writing and revision
~5% AI generating code
~35% testing, tidying, edge cases, aligning implementation with behaviour

Scale (end state)

~106k–125k lines of generated code (TypeScript-heavy)
149 API routes
60+ database tables
6 background job processors
Native iOS + Android builds (single codebase)
74+ specs / docs (~45k–51k lines)
4 major architectural pivots
Hand-written lines of code: 0

The "grown-up software" bits that consumed time

Multi-tenancy guarantees (isolation across clubs)
Payment platform complexity (Stripe Connect-style flows)
Webhook integrity (idempotent processing)
Concurrency / race-condition handling
Recovery paths (refund failures, retries, background processing)
Performance work (eliminating request storms)
Staging and realistic webhook testing

Core tools I relied on:

Cursor - AI-assisted development environment
Vercel - hosting and deployment
Supabase - database and auth
Render - background workers
Stripe - payments
Resend - transactional emails
Twilio - phone authentication
Capacitor - iOS and Android builds
Firebase - push notifications

About the project

Capo started as a way to organise a weekly football game and grew into the app that does it properly: stats your mates actually care about, RSVPs and payments that don't chase you, and AI-balanced teams so the game stays fair. If you run a kickabout and want a bit less chaos and a bit more banter, see how Capo works.