AI Cost Discipline - Playbook

The headline

The entire AI bill to run this product, from the first API call to today, is about $27. This month it is around $13. Roughly 43 percent of the input tokens are served from cache. A little under half of all the spend goes to one feature.

I know all of that to the cent because every AI call this product has ever made is logged, with its tokens and its computed cost, and there is an admin page that adds it up. That sentence is the whole module: you cannot manage what you do not measure, so the first move is to instrument everything.

Why this is a feature, not an afterthought

AI features have a quiet failure mode: they work great in the demo and then the bill arrives. The naive way to ship AI (call the biggest model, send the whole context every time, do it live on every page load) can cost ten times the disciplined way for output a user cannot tell apart. On this project the difference is roughly thirty cents versus three dollars per user per month. At the scale of a solo founder paying out of pocket, that gap is the difference between a feature that is sustainable and one that quietly bankrupts the free tier.

So cost is not something you optimize later. It is a design constraint you build in from the first call, the same way you would build in auth.

The four rules that keep it cheap

1. Cheap model by default. Use the small, fast model for everything: scoring, drafting, classification, extraction. Upgrade to a bigger model only for the specific job where you can measure that the small one is worse, and write down the comparison when you do. Reaching for the biggest model out of habit is the single most common way people burn their limits by mid-afternoon. Most work does not need it.

2. Cache the system prompt on every single call. The instructions are identical across calls; only the per-item data changes. Marking the system block as cacheable means you pay full price for it once and about a tenth of the price after that. This is the biggest lever there is, and it is why 43 percent of this product's input tokens cost almost nothing. There is no call that should skip it.

3. Batch anything the user will not see for five minutes. Nightly re-scoring, the daily digest, bulk enrichment: none of it is urgent, so none of it runs live. The batch path is about half the price of real-time. Real-time is reserved for the things a user is actually waiting on.

4. Do not call the model when nothing changed. Hash the input context; if it matches the last call, skip it entirely and reuse the cached result. Never score a sample contact, a contact with no activity in two months, or a deal that is already lost. The cheapest API call is the one you do not make.

The safety net: assume the rules will be broken

Discipline is necessary but not sufficient, because a bug or a runaway loop will eventually ignore your rules. So there is a hard floor under the whole thing:

A database-backed spend pool with a hard monthly cap on free-tier AI spend. Every free-tier call checks it before running.
A probabilistic degrade: past 80 percent of the cap, calls start falling back to the rule-based path at random, so you glide into the limit instead of slamming into it.
A killswitch environment variable that disables all free-tier AI instantly, with no deploy.
Per-feature, per-workspace monthly caps (for example, a fixed number of on-demand scores).
A daily spend monitor that alerts at one threshold and automatically flips the killswitch at a higher one.

The point of the net is that the worst case is bounded. On this product the free-tier pool has used five cents of its hundred-dollar ceiling, which means the net has never been close to firing, which is exactly how you want it: built, tested, and idle.

Knowing where the money goes

Because every call is logged, the spend breaks down by feature, and the breakdown is itself a lesson. On this product, deal scoring is about half of all spend; AI drafts are the next chunk; and the marquee features people get most excited about (the photo and voice features) cost literal pennies. By tier, essentially all of it is paid-plan usage; free-tier spend is a rounding error.

That visibility is what lets you optimize the right thing. If you wanted to cut the bill, you would look at scoring first, not at the feature that sounds expensive. Without the log you would be guessing, and you would guess wrong (see the "read the evidence" habit from the verification module).

The other half of cost: flag the real-world bill at the decision

Not all cost is tokens. The most expensive choices are infrastructure choices, and they do not show up in the code. The rule here is to flag any real-world cost over about $100 in the same breath as the decision that creates it, with the alternatives, so it is a conscious call and not a surprise on a statement.

The clearest example on this project: choosing where to store walkthrough audio and photos. One option was about $11 a month. The obvious-looking default was about $1,188 a month at the same scale, because of egress pricing. Same feature, a hundred-fold cost difference, decided by a line nobody would notice in a diff. Catching that at decision time saved more than a thousand dollars a month, forever. The same discipline applies to restricted OAuth scopes that trigger paid security audits, to hosting-tier upgrades, and to any third-party API with a real monthly minimum.

The honest caveats

This matters because you are paying. At a funded company someone else eats the bill and this discipline quietly atrophies. Solo, it is existential, which is why it lives in the architecture and not in a wiki nobody reads.
The numbers are small because the usage is small. The point is not that $27 is impressive on its own. The point is that the discipline is what lets the bill stay proportional as usage grows, instead of exploding. Cheap-at-small that becomes expensive-at-scale is a trap; this is built to stay cheap.

Exercise

Instrument first. Log every AI call with its token counts and computed cost before you ship a second feature. Build the one page that sums it.
Default to the cheapest model and cache the system prompt on every call. Verify the cache is actually hitting (you will see it in your logs).
Put a hard cap and a killswitch behind an env var before you open the feature to anyone. Test that the killswitch works.
Read your per-feature breakdown after a week. Optimize the biggest line, not the scariest-sounding one.
Write down the one infrastructure decision in your build that could have cost more than $100 a month, and what you chose instead.