The operating modelChapter 4

Build-Pass Is Not Ship-Pass


The rule

A green build is not a working feature. The type-checker passing, the tests passing, and CI showing a checkmark tell you the code is internally consistent. They do not tell you the thing works when a real person clicks the button. Those are different claims, and the gap between them is where most of my production incidents lived.

Say it as a habit you never break: I do not call a feature done until I have used it myself, signed in, as a real user, on the device the user is on.

Why the gap exists: the model builds blind

The AI writes code it can never see running. It cannot sign in. It cannot see your Vercel dashboard. It cannot tap the button on a phone. It reasons about the code, and its reasoning is usually correct, which is exactly what makes this dangerous: the failure modes are the ones that look fine on paper.

Three flavors of "passed the build, broke in production," all from this project:

1. The hooks-ordering crash. A client component had an early return null placed before a useEffect. React requires hooks to run in the same order every render; the early return changed the hook count between renders and crashed every workspace page in production. The type-checker did not catch it. The unit tests did not catch it. The build was green. Signing in and looking at the page caught it, in one second.

2. "Saved" while saving nothing. The public quote-signing page called an API route that was not on the public allowlist, so the auth middleware returned a 200 with a redirect HTML body instead of running the handler. The page showed the customer "Signed. Thank you." while the signature persisted nowhere. Every automated check was green. Only signing a real quote and then looking for it in the database revealed that nothing had happened.

3. Defined, but silently doing nothing. Two automation step types (add_tag and start_sequence) existed in the schema and the editor UI but were no-ops at runtime: you could build the automation, it would report success, and nothing happened. The code compiled. The feature did not exist.

The pattern across all three: the code was consistent, so every machine check passed, and the behavior was absent or wrong, which only a human exercising the real path could see.

The real-device tax

A whole class of bugs only appears on an actual phone, which is most of where this product's users are:

  • An env var pasted into a dashboard field carried a trailing newline, which became %0A in a storage upload URL and failed the request. "Load failed," on the phone, on the first real walkthrough.
  • A required API key was simply never set in the production environment. The pass-off said it was required; nobody verified it was set. "Transcription failed."
  • A controlled money input reformatted on every keystroke, so typing $4500 came back as $4.00 because the cents-to-dollars round-trip rounded a half-typed string.
  • A storage bucket's CORS allowed only the production domain, so photo uploads from every preview deploy were silently blocked.

None of these are code-logic bugs the model could have caught. They are real-world, real-device, real-environment bugs. The lesson: verify on the device class your users actually use, at the width they use it (375px), as the tier they are on.

The verification ladder

Match the rigor to what a mistake costs (this is the approval ladder from the governance module, applied to checking your own work):

  1. Type-check and build. Necessary, fast, catches a real class of errors. Never sufficient.
  2. Unit tests where they earn it. The business-logic core: the cost budget, the scoring fallbacks, the automation step executors, the MCP tool handlers, the permission checks. Not React components, not copy. A permission bug is silent and cross-tenant; that gets a test. A button color does not.
  3. Manual smoke, signed in, on a preview, on a phone. This is the step that catches all three crashes above. It is not optional, and it is the one the model cannot do for you.
  4. Ground-truth tools when something is wrong: Sentry traces, deploy logs, live network capture, read-only database queries. Not reasoning from the code.

The traps that fooled me

  • Trusting the wrong green light. A "Vercel Preview Comments" status showed success while the actual production deploy had failed an em-dash gate, so two PRs sat un-deployed in main while production silently ran old code. Lesson: trust the deployment status, not a comment that happens to have a checkmark.
  • Testing the wrong URL. More than once I tried a fix, concluded it failed, and was wrong, because I was looking at a preview URL while thinking it was production, or the reverse. Know which URL you are on before you judge a result.
  • Tests that do not run in the build. The unit tests are not part of next build, so three red budget tests sat unnoticed because the build was green. A test you do not run is a comment.
  • The harness cannot see mobile. The build environment could not narrow the viewport, so mobile could not be click-verified in-session. That is precisely why a real-device pass by a human mattered, repeatedly.

Read the evidence, do not guess

Several of these incidents were misdiagnosed two or three times before the real cause surfaced. The quote-signing failure got blamed on Gmail, on a duplicate-quote bug, and on a Clerk route, all wrong, before anyone read the network response and saw the 200-HTML-instead-of-JSON. A console error that looked alarming turned out to be browser-extension noise that never appeared in Sentry.

The discipline: when something breaks, go to the evidence first. Sentry, logs, the actual network request, a read-only query against the real data. Reasoning from the code is how you generate three confident wrong answers in a row. And when the evidence is visual, screenshot it and hand it to the model rather than describing it (the multimodal habit from the build-loop module): a picture of the broken thing is worth three rounds of "it looks like the cost is cut off."

The honest limit

You cannot manually smoke everything, and you should not try. The scalable half of verification is the automated gates (the CI checks that fail the build). The manual half is reserved for the visual and interactive class that machines cannot see: does the page render, does the button do the thing, does it work at 375px, did the data actually persist. Spend your manual attention where a mistake is irreversible or user-facing, and let the gates carry the rest.

Exercise

Ship something with a passing build that breaks when you use it, on purpose, so you feel the gap once and never forget it:

  1. Take any small feature. Get the build green.
  2. Before merging, open the preview, sign in, and walk the actual user path at 375px.
  3. Write down what you find that the green build did not. There is almost always something.
  4. Then make one piece of it un-forgettable: add a test for the logic, or a CI gate for the class of bug, so the next person does not rediscover it the hard way (convention to miss to gate).

This is one chapter of the operating playbook.

Built a workflow that works? Share it.

Publish your pipeline template to the RadiusOS marketplace. Free to install, free to publish - help someone in your trade skip the setup.