[ iOS APP · PRODUCT AI · TESTFLIGHT BETA · 2026 ]

Bitez · on-device AI for what to eat.

You're hungry. You open the app. It picks one restaurant for you — not fifty — and explains why in a sentence you'd actually want to read. This case study is about giving the AI only the jobs it's good at (reading what you typed, writing the friendly explanation) and keeping every real decision somewhere the user can see. Plus the hardest design problem of single-pick recommenders: making the user feel heard even when the algorithm picks the same place again.

→ CORE NEED"I'm hungry. Tell me one place to go right now — and why." Replace endless feeds with a single confident pick the user can audit, change their mind on, and trust again next time.
→ STATUSTestFlight beta · iterating with internal + external testers · every build paired to a specific piece of tester feedback before shipping.
→ MY ROLECo-founder · Product AI designer + iOS engineer. I designed and built the entire app — prompt design, on-device LLM integration, recommender, hallucination guards, sticky reasoning, transparent UI for non-obvious algorithm decisions, brand and visual system.
→ TEAMCo-founder: Akhil Tadiparthi — runs Duo Craft (the publishing entity on the App Store listing), holds the Apple Developer account, and handles incorporation, legal, finance, and distribution. The split lets me focus 100% on product and build.
→ STACKSwiftUI · iOS 17/26 · Apple Foundation Models (on-device LLM) · Google Places (New) · WeatherKit · MetricKit · 100% on-device, no backend.
🔒
App screenshots throughout this case study are gated. The product isn't public yet, so the visuals are blurred until you enter the password. Scroll to the bottom to unlock → Don't have the password? Email me.
2
Apple Foundation Models touchpoints · narrowly scoped
100%
picks ship with user-auditable reasoning
0
hallucinated narrations · model output rejected if unanchored
1/day
typical Google Places call per user, after caching kicks in
01 / THE CORE NEED

"Just tell me where to eat" is a decision request — not a discovery request.

Existing apps treat hunger like a research project. Open Yelp, Google Maps, OpenTable — every one shows a 50-item list sorted by something opaque, and transfers the work of deciding from app to user. The actual job-to-be-done is "hand me a single confident pick, with the reasoning I can audit, in the time it takes me to put my shoes on." The product builds outward from that one act.

→ DECISION FATIGUE

50 options ≠ help.

Showing more transfers cognitive load onto the user. A confident product picks one. The only way to do that without being wrong a lot is to have a reasoning system the user can interrogate when they disagree.

→ TRUST DEFICIT

Recommendations need receipts.

"We picked this because of X, Y, Z" beats "trending now." Every reason is sourced (rating · weather · your taste · distance · time · budget) and shown explicitly. Showing the work is what makes one-pick viable.

→ MOMENT-AWARENESS

Mood + weather + budget + history all matter.

The same person at noon on Wednesday and 9pm on Friday wants different things. The recommender reads time, weather, learned price tier, recently-presented set, and a single user-chosen mood. The AI's job is to make that read feel human, not algorithmic.

02 / AI ARCHITECTURE

The model translates. Code decides. Both layers are visible to the user.

This is the central Product AI question on every product I design: what is the model allowed to decide? In Bitez the answer is narrow on purpose. The LLM (Apple Foundation Models, on-device) gets two jobs — both about turning messy human language into something deterministic code can work with, or rendering deterministic code's output in human language. Everything that affects the actual recommendation runs through plain Swift.

→ LLM SIDE · Apple Foundation Models

Reads intent. Writes warmth.

  • Parse free-text dish requests. "Curry-something hearty" / "ramyon" / "fish that doesn't smell." Returns a typed DishIntent the recommender can use.
  • Narrate the pick in friend-voice. One warm sentence + 3 short reasons, grounded only in the deterministic reasoning facts. Tone the user actually wants to read.
  • Nothing the user can't override. Both outputs are advisory; the underlying signals are always visible if AI fails or hallucinates.
HANDOFF
JSON
SCORING
→ CODE SIDE · deterministic recommender

Picks the place. Justifies the pick.

  • Weighted scoring across rating, distance, walkability, cuisine match, mood bias, budget tier, learned history, and a recently-presented penalty for variety.
  • The reasoning engine mines facts from sourced signals — Ratings, Weather, Your taste, Distance, Time, Your budget — each carrying a strength score the UI ranks.
  • Honesty fact for budget mismatch: if a $$ wins for a $$$$ user, the engine surfaces a dedicated fact explaining why — turned a perceived bug into the most-loved feature in testing.

[ MODEL READS · CODE DECIDES · MODEL NARRATES ]

03 / TWO AI MOMENTS, SHOWN

Both AI touchpoints, captured in the live app.

Apple's on-device LLM has two narrow, high-leverage jobs in Bitez. Below are real frames from the TestFlight build — the model isn't simulated, the screenshots aren't mockups. Each one shows exactly what the LLM contributes and what's still being decided by deterministic code underneath.

Bitez calibration screen — Found 20 spots near 113564, Filtering by $$$$ budget, Matching French, Locking in Spicy lover Enter password below to view
Moment 01 · PARSE

Free-text dish intent → structured cuisine, mood, and flags.

The calibration screen replays everything the user just said. The last row — "Locking in Spicy lover" — is the Foundation Models output. The user didn't pick "Spicy" from a list; they typed it (or "curry-something hearty", or "fish that doesn't smell"), and Apple's on-device LLM extracted it into a typed DishIntent the recommender knows how to use.

User input: "curry-something hearty"
LLM output: { dish: "curry", cuisine: "Indian", isHearty: true }

The model never picks the restaurant. It only converts messy phrasing into structured fields. A 110-entry keyword dictionary acts as the safety net — instant on common cases, the LLM only runs when the dictionary can't pin a cuisine. The result feels like the app understood you.

Apple Foundation Models @Generable schema Keyword fallback · 110+ entries 5s timeout race
Bitez pick view — Nami Nori West Village, friend-voice narration above, budget honesty fact below explaining why $$ won when user said $$$$ Enter password below to view
Moment 02 · NARRATE

Sourced reasoning facts → one warm sentence the user actually wants to read.

The line above the restaurant name — "Jestaz, Out of everything nearby, Nami Nori West Village is the one. Trust me." — is the model speaking. But it's not free-styling. It's reading the deterministic reasoning facts (rating, distance, walking time, mood, weather, recently-presented penalty, the budget-honesty fact below) and rewriting them in one sentence.

Hallucination guard: the line is rejected if it doesn't mention the restaurant name, cuisine, or user's named dish. If it fails, the template version takes over and the user never knows.

Underneath, the deterministic reasoning is still visible — every fact is sourced (BUDGET in this frame), strength-ranked, and auditable. The AI adds warmth; it doesn't replace the math. When a user changes their budget and the same place wins, the reasoning row labelled "Still your best match" is sticky — AI narration can't overwrite the honest fact underneath.

@Generable AINarrationSchema Hallucination guard Grounded on facts Sticky reasoning

Two bounded jobs. Both about language — reading user phrasing, writing the explanation. Neither one ever decides anything that touches money, distance, or the actual pick.

04 / THE TRANSPARENCY PROBLEM

When the user moves but the recommendation doesn't.

A user opens Settings, drags budget from $$ to $$$$, taps Save & refresh. The recommender re-ranks. The same restaurant wins. Mathematically the algorithm is right: a 4.5★ place 12 minutes away beats every $$$$ option in the user's radius on signal weight. Perceptually the app is broken. The user changed the input. They expect the output to change. When it doesn't, the algorithm has three seconds to defend itself out loud — or it loses the user.

This is the design problem most "smart" recommenders lose on. Yelp shows fifty options so the user feels in control. A single-pick product can't do that — and that's exactly the problem worth solving. This is the signature moment of the case study because it's the test of whether transparent reasoning is a real design discipline or just a buzzword.

Pick view defending an off-tier choice with five visible design moves Enter password below to view
The fix · five visible layers

Same pick. Visibly defended on five fronts, in one screen.

The feedback came from a real tester. Direct quote, not paraphrased: "my friend keep saying after you change the budget range, it's giving you the same restaurant." One line of feedback, five layered design moves in response. The same place wins — but everything around the place changes loudly enough that the user understands the algorithm did run, and did hear them.

→ 01 The price-tier badge gets visibly louder. "$$ below your $$$$" rendered inline under the rating row. The single most direct answer to "did my $$$$ change even take effect?" — yes, we read $$$$, and yes, we picked $$ anyway. The user knows the signal landed before they read another line. UI EMPHASIS
→ 02 The budget honesty fact. A sticky reasoning row, surfacing every time pick.priceLevel != profile.budget: "You said $$$$ but $$ won — it beats your higher-tier options here on rating, distance, or what you actually keep eating." Sticky means AI narration cannot overwrite it. The reasoning shows its work, in plain language, every time. NARRATIVE
→ 03 "Still your strongest match" chip. When the user-driven refresh re-ranks to the same restaurant, a small confirmation chip surfaces: "This is my strongest match for you right now — trust it." Tells the user explicitly the algorithm ran AND landed here again — not "nothing happened". RUN CONFIRMATION
→ 04 Recently-presented penalty. Inside the recommender: a −25 score penalty against any restaurant the user has been shown in the recent past. Doesn't reshape the whole ranking, but gives the algorithm a small variety nudge so cold relaunches don't always land on yesterday's pick. Pure code, invisible to the user — but the user feels the variety. ALGORITHM NUDGE
→ 05 Post-save confirmation toast. "Updated — showing your top $$$$ picks" appears briefly after Save & refresh, before the user has read anything else on screen. The change is acknowledged the moment it happens, not retroactively inferred from a different pick that never came. ACK

The recommendation didn't change. Everything around it did. Five visible design moves, ranging from a single inline phrase to a recommender-level score penalty, all in service of one outcome: the user understands the algorithm heard them, even when the result lands in the same place. The friend who'd called it broken never raised that complaint again.

[ ALGORITHM HOLDS  ·  UI DEFENDS  ·  USER UNDERSTANDS ]

05 / FEEDING THE AI WITHOUT PAYING TWICE

On-device AI is free. The ground truth it anchors to isn't.

Apple's Foundation Models run free on the user's phone. Google Places (New) — the data the recommender grounds every reasoning fact against — does not. A naive build hits the API on every screen open and burns the unit economics inside a week. Three small caching choices, each protecting the AI's accuracy without paying for it more than once a day.

01 · POOLS

Multi-pool cache.

Eight LRU pools by query signature — toggling Italian → Korean → Italian is free after the second visit.

02 · TIME

Local "open now" math.

Fetch the weekly schedule once, compute open / closed locally against the device clock. The pool stays valid 24 hours, not 30 minutes.

03 · EMERGENCY

Weather-triggered refetch.

Severe-weather alerts at the user's coordinate enter the cache signature → automatic miss → live refetch right when local hours actually shift.

[ NAIVE: 1 API CALL / APP OPEN   →   STAGED: ~1 / DAY / USER ]

05 / TESTER-DRIVEN ITERATION

Every TestFlight build is paired to a specific tester observation.

Iteration is the proof that the product is being validated by real people, not assumed from the inside. Each release below pairs an actual quote from a TestFlight tester with the design / engineering response that shipped in answer. The point isn't speed — it's that the loop is closed: feedback in, fix out, ship, repeat. Nothing reverse-engineered for the case study.

01 Hard-filter rule: drop only confirmed-closed places from Google's openNow; keep unknown hours in the pool but bias them lower. Pool got smaller, pool got more honest. Three days later this got smarter: Stage 2 computes "open now" locally from the cached weekly schedule.
02 Three-layer venue filter: primary-type denylist (night_club, casino, adult_entertainment), secondary-type denylist (any matching tag), and a name-token blocklist for places whose primary type lies but whose name doesn't. Plus excludedTypes pushed to Google at fetch time.
03 Four moves at once: heavier price-tier weight in the scorer, a much larger visible $$ badge, a "Still your top pick" confirmation chip when re-ranking legitimately picks the same place, and a -25 recently-presented penalty so cold relaunches don't always land on yesterday's pick.
04 The budget honesty fact: a dedicated reasoning row appears when the pick is off-tier — "You said $$$$ but $$ won — it beats your higher-tier options on rating, distance, or what you actually keep eating." Sticky (AI narration can't replace it), visible the entire time the pick stays on screen.
05 Two-layer freeze defense: 5-second timeout race around every AI call (TaskGroup against a sleep), and a 2.5-second LoaderOverlay kill-switch that force-hides any stuck spinner. The unhappy path now has the same care as the happy one.
06 Triple-layer markdown defense: escape every dynamic field with a markdownSafe extension, balance-check ** / _ counts at the AttributedString.markdown() entry, and sanitize AI-generated reasons before they reach the parser. Same crash had recurred — this stop-shipped it.
07 Cost optimization Stages 1 + 2: multi-pool cache (8 LRU signatures) so toggling cuisines is free after the first visit, plus 24-hour cache backed by locally-computed open/closed math. Combined cumulative API savings: ~60-80% depending on session pattern.
08 Stage 3 — WeatherKit emergency refetch. Severe / extreme weather alerts at the user's coordinate enter the cache signature, automatically busting the 24-hour cache exactly when local schedules become unreliable. Home banner: "Winter Storm Warning — hours may vary." Honest signal, free API.
09 Crash-crumb pattern. UserDefaults flag set before each AI call, cleared on normal return. Next launch sees it stuck → previous run died inside Apple's framework → disable AI for this session, escalate to permanent after two strikes. App never crashes inside the same Apple bug twice in a row.

Reading this top-to-bottom is the design discipline: every fix paid down a specific human moment. Not a sprint plan, not a roadmap — feedback in, fix out, ship.

Mood Gate v1 — 4 mood cards, no AI input Enter password
EARLY BUILD · 4 moods, no AI

The mood gate went from a 4-card menu to a 2-card + free-text input.

Day-zero hypothesis: pick a mood, get a place. Day-ten reality: testers wanted to say what they wanted. Removing "I'm hungry" and "Comfort food" cut the mood taxonomy in half; adding the free-text input gave the LLM its first job (PARSE) and let users phrase real cravings — "ramyon", "fish that doesn't smell", "curry-something hearty".

↓ AI ENTERED HERE ↓

The "Try 'curry' or 'something spicy'" placeholder is a hint and a contract: speak normally, the app understands. Apple Foundation Models parses, the keyword dictionary is the safety net.

Mood Gate v3 — 2 mood cards plus an AI free-text input Enter password
CURRENT BUILD · 2 moods + AI input
06 / APP WALKTHROUGH

The actual product, the screens, the TestFlight build.

Everything above explains why Bitez is built the way it is. What follows is what it actually looks like — the real screens, the interaction patterns, and how to try it yourself. Because the app is unreleased, this section is gated behind a password I share with hiring managers and collaborators. Email me for access — usually a few hours.

Bitez welcome screen — I'll pick. You eat. Bitez onboarding step 3 — cuisine preferences with Surprise me button Bitez calibration screen showing live Google data Bitez pick view with budget honesty fact
Try Bitez on your iPhone
TestFlight invite · iOS 17+ · 20+ testers so far.
Open TestFlight ↗

Detailed walkthrough notes: the mood gate logic, the calibration animation with live restaurant count, the reasoning chip behavior on a budget mismatch, the recently-presented penalty for variety, the offline / mock-data banner system, the Apple Maps fallback for food deserts, and the in-app implicit history learning loop.

07 / WHAT I LEARNED

Product AI isn't about adding more model. It's about deciding what the model isn't allowed to do.

Product AI isn't about adding more model. It's about deciding precisely where the model adds value the rest of the system can't. In Bitez the model parses messy human input into typed data and rewrites deterministic facts in human voice — both irreducibly language tasks. Everything else (which restaurant, which budget, which hours, which signals matter most tonight) is plain Swift the user can audit. That separation is what made the AI feel present without ever being in charge.

A single confident pick. Two AI moments — read intent, write warmth. Reasoning the user can audit. Infrastructure that doesn't cost more than it earns. That's the product. The case study is a tour of the design decisions holding those four things together.