What went wrong, what it cost, and what systems were built to prevent it from happening again.
Part of the ForkIt! Case Study. Read the full story →
Bugs that shipped. Builds that broke. Things that should have been caught before anyone saw them.
Apple rejected the v2 submission for missing subscription compliance. Auto-renewal disclosure text was missing. There was no Restore Purchases button. Terms of Service and Privacy Policy links were not accessible from the paywall screen. No LLM tutorial, no human tutorial, no documentation of any kind mentioned these requirements.
What it costTwo days of rework. A full resubmission cycle. The realization that the LLM had confidently built an IAP flow that would never pass review.
What fixed itAdded all three compliance elements. Created a pre-submission checklist. The 31-review suite now includes Review 16 (IAP/Subscription Compliance) that checks for every requirement Apple and Google enforce.
During a code cleanup, Claude renamed API endpoints for consistency. Clean code. Better naming. Deployed to Vercel. The problem: every app already installed on users' phones was still calling the old endpoint names. Live users started getting errors. The old URLs returned 404s.
What it costLive users hit errors until backward-compatibility rewrites were added. Those rewrites still exist in the Vercel config today as tech debt (#115).
What fixed itAdded rewrites in vercel.json to map old paths to new endpoints. Established a rule: never rename a live API endpoint without a redirect. The review suite now includes Review 21 (API Endpoint Hygiene) and Review 30 (Migration and Upgrade Safety).
Android users with non-Chrome default browsers (DuckDuckGo, Firefox) experienced hangs during Clerk SSO sign-in. The app froze completely. Chrome Custom Tabs opened in the wrong browser, which couldn't handle the redirect back. The main thread blocked. Android killed the app.
What it costHours of debugging a platform-level limitation that has no real fix. Users with non-Chrome defaults still see friction.
What fixed itAdded a hint toast on Android sign-in failure explaining the Chrome requirement. Documented it as a known limitation in the project guide. This is a platform constraint, not something app code can solve. The fix was transparency, not engineering.
The stakeholder engagement was right. The preparation was wrong.
Work trip to Annapolis. Three colleagues who had heard about the app wanted to try it. Perfect opportunity: real users, real context, at a restaurant, ready to pick a place for dinner. Opened the app. It didn't work.
The backend had been updated during development without a matching app build. The dev build on the phone was calling endpoints that had changed. The app couldn't fetch restaurants. At a restaurant. In front of three potential users.
What it costThree users. Not hypothetical users. Three real people who were interested, present, and ready to try it. They watched it fail and moved on. They never came back to it.
Every demo is a stakeholder touchpoint. This one failed because the build wasn't tested on the device, in the environment, within the hour. The engagement instinct was right. The preparation discipline didn't exist yet.What fixed it
Demo prep protocol: before any stakeholder touchpoint, verify the build is production (not dev), test on the actual device, confirm the backend matches, and have a fallback plan if it crashes. This became part of the project's pre-push checklist.
This is the centerpiece. Three failures with the same root cause: the LLM gave a confident answer, I trusted it, and the answer was wrong. Each time, the cost compounded because the wrong answer became the basis for the next decision.
Asked Claude how many EAS builds were included on the Starter plan. Answer: "Unlimited!" Planned accordingly. Weeks later, asked again. Claude read its own previous memory file. Same answer: "Unlimited!" Felt confirmed.
Then an email arrived from Expo: 80% of build credits consumed with two weeks left in the billing cycle.
The LLM had been wrong the first time. Then it cited itself as a source the second time. A hallucination, reinforced by its own memory, presented as verified information.
What it costScrambled to cancel unnecessary builds. Established a rule to cancel superseded builds immediately. Nearly burned through a paid plan's allocation on builds that were never used.
What fixed itNew rule: never trust LLM pricing knowledge. Always verify vendor pricing pages directly before making cost-driven decisions. Added to project memory as a permanent instruction.
Claude stated that Google Maps Platform included a $200/month free credit. Architecture decisions, cost projections, and the free tier design were all built on this assumption. The entire "as free as possible" pricing model assumed that credit existed.
It didn't. Google had eliminated the $200 monthly credit over a year before the project started. The LLM's training data was stale. Every cost calculation based on that number was wrong.
What it costRearchitecting the API call strategy. Adding pool caching, client-side filtering, and aggressive deduplication to reduce API spend. The free tier limits (20 searches/month) exist partly because the expected credit never materialized.
What fixed itVerified Google's actual pricing page. Rebuilt cost projections from real numbers. Added "verify vendor pricing" as a permanent memory instruction. Every cost-driven decision now requires a primary source, not an LLM assertion.
The backend API had no rate limiting or origin checking for weeks. No user data was at risk (the search endpoint doesn't store or transmit personal data), but anyone who discovered the URL could have run up the Google Places API bill on my account.
The LLM built it. The LLM didn't flag the gap. I didn't know to check. It was discovered during a code review, not because anyone exploited it.
What it costWeeks of financial exposure. Nobody found it. The risk was to my billing account, not to user data.
What fixed itAdded rate limiting (30 req/min per IP), origin checking, and security middleware. The review suite includes Review 7 (Security), Review 18 (Operational Readiness), and Review 21 (API Endpoint Hygiene). Every endpoint now validates before processing.
Separate from the billing exposure: API endpoints that touch user data (favorites, history, sync) accepted a user ID in the request body but never verified it belonged to the person making the request. Someone could have read or written another user's data by guessing their ID.
What it costDiscovered during a manual code review session. Patched immediately. No exploitation occurred. But the gap was real: user data endpoints without identity verification.
What fixed itAdded Clerk JWT verification on all protected endpoints. Established Review 19 (Auth and Identity) in the review suite. Auth is now verified server-side on every request that touches user data.
Claude picked up stray files from a different project (HabitCoach, a blue and gold themed app) and built the entire onboarding tour in the wrong brand colors. The tour was fully functional, well-structured, and completely off-brand. Blue and gold instead of orange and teal.
What it costMultiple design reviews missed it. I didn't catch it until later, when the gold "felt off." The LLM cross-contaminated two projects, and the human reviewing it didn't notice because the tour worked. Functional correctness masked visual wrongness.
What fixed itRebuilt the tour in the correct brand. Established the color theory doc (orange = problem, teal = solution) and the brand style guide in CLAUDE.md so the LLM has an authoritative reference. Review 22 (Brand Continuity) now checks every screen for theme consistency.
LLMs will say "I can't do that" or "you'll need to do this manually" about things they absolutely can do. The first time it happened, I accepted it and spent time on manual work that wasn't necessary. It kept happening. Each time, the cost was small — a few minutes here, a workaround there. But it accumulated.
What it costNo dramatic failure. No outage. Just accumulated time lost to unnecessary manual work, spread across dozens of interactions. The kind of cost that doesn't show up in a postmortem because no single instance was big enough to flag.
What fixed itA permanent instruction in the project memory: don't say "I can't" without trying workarounds first. Push back, ask to try, or rephrase. The lesson: don't accept the first "no" from an LLM. The default is surrender, not effort.
The pattern across all six: the LLM gave a confident output. I had no framework to evaluate it. The output became infrastructure. When it turned out to be wrong, the fix was expensive because decisions had been built on top of it.
Not code bugs. Judgment bugs. Decisions that cost time because the framework for evaluating them didn't exist yet.
Weeks of design work on features driven by feedback from a mom group on social media. The feedback was enthusiastic. The problem: the people giving it weren't users. They were repeating what sounded interesting, filtered through their own context, with no connection to the actual use case.
The features they described (social check-ins, recipe integration, gamification) sounded reasonable in isolation. They were completely wrong for the product. The app picks a random restaurant. It doesn't need a social feed.
What it costWeeks of design iteration on the wrong mental model. Time that could have been spent on features actual users needed (like the exclude filter, or walk mode).
What fixed itBuilt a feedback taxonomy with five types (behavioral, voiced pain, feature requests, ambient noise, market signal). The mom group feedback was genuine — the mistake was in how I weighted it. The signal-to-noise rubric now requires decomposing every request into the underlying need before evaluation.
Built a promo code system for early adopters and testers. Had codes ready to distribute. The problem: the highest-value testers were already gone. The hostel strangers who gave the exclude-filter feedback, the walker who surfaced the walk-mode gap, the colleague who couldn't demo to a friend. No contact information was captured during those interactions.
What it costLost the ability to close the loop with the people who shaped the product most. Could not reward the testers who mattered. Could not get follow-up feedback from the users whose input was most valuable.
What fixed itCreated STAKEHOLDERS.md in the repo root: a lightweight register tracking who gave feedback, what channel, what type, what it became, and whether the loop was closed. Includes a capture checklist for future sessions. Not a formal tool; just a file that gets updated. The hostel testers are still lost, but the next round won't be.
Early failures were code. Middle failures were LLM trust. Late failures were process. The progression tells the real story: what you don't know shifts as you learn.
For every category of failure, a system was built to prevent recurrence. The failures are orange. The responses are teal.
31-review suite with automated checks (ESLint, Prettier, Secretlint, Knip, npm audit) plus 26 manual deep-dive reviews. Runs before every deploy.
Demo prep protocol. Before any stakeholder touchpoint: verify the build is production, test on the actual device, confirm the backend matches, have a fallback.
Verify-before-trust protocol. Never trust LLM assertions about pricing, platform limits, or compliance requirements. Always check the primary source.
Feedback taxonomy and stakeholder map. Classify every piece of feedback before acting on it. Capture contact info during engagement.
Most of these failures would have been caught in five minutes by someone with the right experience. The iOS rejection, the billing exposure, the endpoint rename, the pricing assumption. None of them were hard problems. They were gaps in knowledge that a mentor, a code reviewer, or a senior engineer would have flagged instantly.
A solo builder doesn't have that person. An LLM is not that person. The LLM will confidently fill the gap with wrong answers that feel right.
The fix was not "be smarter." The fix was building systems that catch what I can't see. The 31-review suite, the verify-before-trust rule, the demo prep protocol. These are the scar tissue. They exist because something broke.