You're hungry. You open the app. It picks one restaurant for you — not fifty — and explains why in a sentence you'd actually want to read. This case study is about giving the AI only the jobs it's good at (reading what you typed, writing the friendly explanation) and keeping every real decision somewhere the user can see. Plus the hardest design problem of single-pick recommenders: making the user feel heard even when the algorithm picks the same place again.
Existing apps treat hunger like a research project. Open Yelp, Google Maps, OpenTable — every one shows a 50-item list sorted by something opaque, and transfers the work of deciding from app to user. The actual job-to-be-done is "hand me a single confident pick, with the reasoning I can audit, in the time it takes me to put my shoes on." The product builds outward from that one act.
Showing more transfers cognitive load onto the user. A confident product picks one. The only way to do that without being wrong a lot is to have a reasoning system the user can interrogate when they disagree.
"We picked this because of X, Y, Z" beats "trending now." Every reason is sourced (rating · weather · your taste · distance · time · budget) and shown explicitly. Showing the work is what makes one-pick viable.
The same person at noon on Wednesday and 9pm on Friday wants different things. The recommender reads time, weather, learned price tier, recently-presented set, and a single user-chosen mood. The AI's job is to make that read feel human, not algorithmic.
This is the central Product AI question on every product I design: what is the model allowed to decide? In Bitez the answer is narrow on purpose. The LLM (Apple Foundation Models, on-device) gets two jobs — both about turning messy human language into something deterministic code can work with, or rendering deterministic code's output in human language. Everything that affects the actual recommendation runs through plain Swift.
DishIntent the recommender can use.[ MODEL READS · CODE DECIDES · MODEL NARRATES ]
Apple's on-device LLM has two narrow, high-leverage jobs in Bitez. Below are real frames from the TestFlight build — the model isn't simulated, the screenshots aren't mockups. Each one shows exactly what the LLM contributes and what's still being decided by deterministic code underneath.
Enter password below to view
The calibration screen replays everything the user just said. The last row — "Locking in Spicy lover" — is the Foundation Models output. The user didn't pick "Spicy" from a list; they typed it (or "curry-something hearty", or "fish that doesn't smell"), and Apple's on-device LLM extracted it into a typed DishIntent the recommender knows how to use.
{ dish: "curry", cuisine: "Indian", isHearty: true }The model never picks the restaurant. It only converts messy phrasing into structured fields. A 110-entry keyword dictionary acts as the safety net — instant on common cases, the LLM only runs when the dictionary can't pin a cuisine. The result feels like the app understood you.
Enter password below to view
The line above the restaurant name — "Jestaz, Out of everything nearby, Nami Nori West Village is the one. Trust me." — is the model speaking. But it's not free-styling. It's reading the deterministic reasoning facts (rating, distance, walking time, mood, weather, recently-presented penalty, the budget-honesty fact below) and rewriting them in one sentence.
Underneath, the deterministic reasoning is still visible — every fact is sourced (BUDGET in this frame), strength-ranked, and auditable. The AI adds warmth; it doesn't replace the math. When a user changes their budget and the same place wins, the reasoning row labelled "Still your best match" is sticky — AI narration can't overwrite the honest fact underneath.
Two bounded jobs. Both about language — reading user phrasing, writing the explanation. Neither one ever decides anything that touches money, distance, or the actual pick.
A user opens Settings, drags budget from $$ to $$$$, taps Save & refresh. The recommender re-ranks. The same restaurant wins. Mathematically the algorithm is right: a 4.5★ place 12 minutes away beats every $$$$ option in the user's radius on signal weight. Perceptually the app is broken. The user changed the input. They expect the output to change. When it doesn't, the algorithm has three seconds to defend itself out loud — or it loses the user.
This is the design problem most "smart" recommenders lose on. Yelp shows fifty options so the user feels in control. A single-pick product can't do that — and that's exactly the problem worth solving. This is the signature moment of the case study because it's the test of whether transparent reasoning is a real design discipline or just a buzzword.
Enter password below to view
The feedback came from a real tester. Direct quote, not paraphrased: "my friend keep saying after you change the budget range, it's giving you the same restaurant." One line of feedback, five layered design moves in response. The same place wins — but everything around the place changes loudly enough that the user understands the algorithm did run, and did hear them.
−25 score penalty against any restaurant the user has been shown in the recent past. Doesn't reshape the whole ranking, but gives the algorithm a small variety nudge so cold relaunches don't always land on yesterday's pick. Pure code, invisible to the user — but the user feels the variety.
ALGORITHM NUDGE
The recommendation didn't change. Everything around it did. Five visible design moves, ranging from a single inline phrase to a recommender-level score penalty, all in service of one outcome: the user understands the algorithm heard them, even when the result lands in the same place. The friend who'd called it broken never raised that complaint again.
[ ALGORITHM HOLDS · UI DEFENDS · USER UNDERSTANDS ]
Apple's Foundation Models run free on the user's phone. Google Places (New) — the data the recommender grounds every reasoning fact against — does not. A naive build hits the API on every screen open and burns the unit economics inside a week. Three small caching choices, each protecting the AI's accuracy without paying for it more than once a day.
Eight LRU pools by query signature — toggling Italian → Korean → Italian is free after the second visit.
Fetch the weekly schedule once, compute open / closed locally against the device clock. The pool stays valid 24 hours, not 30 minutes.
Severe-weather alerts at the user's coordinate enter the cache signature → automatic miss → live refetch right when local hours actually shift.
[ NAIVE: 1 API CALL / APP OPEN → STAGED: ~1 / DAY / USER ]
Iteration is the proof that the product is being validated by real people, not assumed from the inside. Each release below pairs an actual quote from a TestFlight tester with the design / engineering response that shipped in answer. The point isn't speed — it's that the loop is closed: feedback in, fix out, ship, repeat. Nothing reverse-engineered for the case study.
openNow; keep unknown hours in the pool but bias them lower. Pool got smaller, pool got more honest. Three days later this got smarter: Stage 2 computes "open now" locally from the cached weekly schedule.
$$ badge, a "Still your top pick" confirmation chip when re-ranking legitimately picks the same place, and a -25 recently-presented penalty so cold relaunches don't always land on yesterday's pick.
markdownSafe extension, balance-check ** / _ counts at the AttributedString.markdown() entry, and sanitize AI-generated reasons before they reach the parser. Same crash had recurred — this stop-shipped it.
Reading this top-to-bottom is the design discipline: every fix paid down a specific human moment. Not a sprint plan, not a roadmap — feedback in, fix out, ship.
Enter password
Day-zero hypothesis: pick a mood, get a place. Day-ten reality: testers wanted to say what they wanted. Removing "I'm hungry" and "Comfort food" cut the mood taxonomy in half; adding the free-text input gave the LLM its first job (PARSE) and let users phrase real cravings — "ramyon", "fish that doesn't smell", "curry-something hearty".
The "Try 'curry' or 'something spicy'" placeholder is a hint and a contract: speak normally, the app understands. Apple Foundation Models parses, the keyword dictionary is the safety net.
Enter password
Everything above explains why Bitez is built the way it is. What follows is what it actually looks like — the real screens, the interaction patterns, and how to try it yourself. Because the app is unreleased, this section is gated behind a password I share with hiring managers and collaborators. Email me for access — usually a few hours.
Detailed walkthrough notes: the mood gate logic, the calibration animation with live restaurant count, the reasoning chip behavior on a budget mismatch, the recently-presented penalty for variety, the offline / mock-data banner system, the Apple Maps fallback for food deserts, and the in-app implicit history learning loop.
Product AI isn't about adding more model. It's about deciding precisely where the model adds value the rest of the system can't. In Bitez the model parses messy human input into typed data and rewrites deterministic facts in human voice — both irreducibly language tasks. Everything else (which restaurant, which budget, which hours, which signals matter most tonight) is plain Swift the user can audit. That separation is what made the AI feel present without ever being in charge.
A single confident pick. Two AI moments — read intent, write warmth. Reasoning the user can audit. Infrastructure that doesn't cost more than it earns. That's the product. The case study is a tour of the design decisions holding those four things together.