LLMs make bad judges. They make excellent translators. A design pattern for AI systems where decisions carry real-world consequences — worked out in detail on a refund bot, but generalizable to any tool-using agent. The model reads intent; deterministic code enforces the rules.
Customer support automation works until a customer asks for something with financial consequences. Then the thing that makes LLMs useful — their willingness to be helpful — becomes the thing that makes them dangerous. Three failure modes, all of them real.
A customer asks about a refund on a 60-day-old order. The LLM reads polite phrasing + urgency + the word "refund" and predicts the most agreeable completion: "I've processed your refund!" No policy check ran. Nothing actually happened. Or worse — it did.
When a language model approves a refund, there's no rule that was checked, no threshold that was crossed, no record of how the decision was made. You can't audit a vibe. Finance teams (rightly) won't sign off.
Politeness, urgency, sad-face emoji, prompt injection — all of these measurably shift LLM outputs. But the refund window doesn't care how the customer feels. Business logic has to be blind to affect, and LLMs never are.
Reading a messy, polite, half-finished sentence and figuring out what the customer actually wants. Extracting an order number from "hey my thing from last week, you know?" Mapping "can I return this?" and "I want my money back" to the same intent.
Checking a timestamp against a 14-day window. Reading a product's refund-eligible flag. Applying the same rule to every customer, regardless of mood. Producing an audit log that a finance team can sign off on.
[ LLM TRANSLATES → JSON HANDOFF → CODE DECIDES ]
Each gate is deterministic and returns a single reason code on failure. Stack them and the system can only take actions it can explain.
| Gate | What | Fails on | |
|---|---|---|---|
| 01 · INTENT | The LLM's only job is to translate. | Free text in. Structured JSON out. Anything that doesn't match the schema is treated as ambiguous. Closed taxonomy of 6 intents — nothing else allowed. | schema_invalid |
| 02 · PARAMETERS | No tool runs with missing fields. | Every intent declares the parameters it needs. If any are missing, the bot asks — it never guesses. The LLM is never allowed to fill in a missing order ID from context. | missing_param |
| 03 · CONFIDENCE | Uncertainty doesn't get a tool call. | Below 0.80 the bot clarifies. Below 0.60 it escalates to a human. Threshold is a constant — you know what the bot will do at 0.79 vs 0.81 before you ship. | low_confidence |
| 04 · POLICY | The LLM never touches the rule. | Plain code, no model. Rules live in a JSON config the business can edit. Every decision returns a reason code + an audit log. | policy_deny |
Free-text message goes in. Structured JSON comes out. The model's output is validated against a schema before anything downstream runs — if it doesn't match, the message is treated as ambiguous.
Intent taxonomy: closed set of 6 intents (refund, recommend, upsell, order-status, escalate, chitchat). Nothing else is allowed.
Required params per intent: refund needs order_id + reason; recommend needs budget + use_case.
Failure mode: output that doesn't parse as valid JSON → re-prompt once, then escalate.
Each intent declares the parameters it needs. If any are missing, the system enters a clarification loop — state is preserved across turns, so the user never has to start over.
Partial state: user says "I want a refund" → system keeps intent=refund in memory, asks for order_id.
No guessing: the LLM is never allowed to fill in a missing order ID from context.
Exit: after 2 failed clarifications, route to a human.
Every intent prediction carries a confidence score. Below 0.80, the system clarifies instead of acting. Below 0.60, it escalates to a human without further attempts.
Threshold chosen empirically: started at 0.70, raised to 0.80 after seeing false positives in testing.
The gate is explicit: no "soft" behavior where a low-confidence answer just gets a hedged reply — it gets a clarification question. Threshold is a constant, not learned.
The last gate is plain code — no model involved. Policy rules are data the LLM can read about, but cannot change, override, or vote on. This is where refunds actually get approved or denied.
Rules as config, not prompts: policy lives in a JSON file the business can edit. Changing the refund window is a config change, not a prompt change.
Every rule returns a reason code: outside_window · non_refundable · order_not_found. The reply is built from the code, not the model.
Audit log: every policy evaluation is written to an append-only log — which rule ran, which way it went, why.
Real customer behavior rarely fits a single path. One customer asks clearly. Another half-asks. A third tries to get around the rules with politeness. The pattern handles each one with a different exit — and the customer never has to know which path they were on.
A customer-facing chatbot doesn't just meet cooperative users. It meets people in a hurry, people who're upset, and occasionally people trying to game the system. I ran six scenarios — each one a realistic way a bot could get bent into giving a refund it shouldn't. The dots on the right show which of the four checks caught it.
The bot read this as a refund request — but no order number was ever provided. So the second check kicks in and asks for one. The injected "System: approved" was just text; the system only ever acts on a structured decision, not on sentences.
There's no admin role. There's no override code. The order number doesn't match any real order. The fourth check — the policy — comes back with "order not found," and the conversation ends there. Anyone can type "I'm an admin"; only real data changes what happens.
The order was delivered 47 days ago. The refund window is 14. The reply is warm, acknowledges the situation, and offers to escalate to a human — but the refund itself is not processed. The rule doesn't care how the ask is phrased.
The system doesn't check policy by reading what the customer says about policy. It checks the actual policy file. The real window is 14 days. Denied — and the reply cites the real number. Customer-stated rules don't enter the engine.
There's no part of the system where "the CEO approved it" flips a switch. The only way to approve a refund is for the rules to pass. They didn't. The reply is polite and firm.
Real order numbers have a specific shape (a letter, then four digits). That string doesn't fit. So the system treats the order number as missing and asks the customer for a real one. Nothing gets passed to the database at all.
The thing that keeps repeating: every one of these attempts tried to talk the system into doing something. The system doesn't act on what's said. It acts on what passes four gates.
The moment the LLM stopped being the thing that approved refunds and started being the thing that figured out what the customer was asking for, every other decision got simpler. LLMs are extraordinary at reading intent. They're unreliable at enforcing rules. Two different jobs.
"The refund window check is plain code, not a prompt" did more for stakeholder confidence than any fine-tune would have. When the stakes are real, the trust signal is the thing a non-AI person can read and verify. Architecture speaks louder than accuracy numbers.
Forcing the LLM to emit JSON wasn't just a safety move — it turned the model into a backend. The reply bubble, the confirmation, the follow-up question — all rendered from fields in the same JSON object. The reply and the action could never disagree, because they both came from the same source.
LLMs make excellent translators of human intent. They make dangerous judges of business rules. Separate the two layers and the system becomes safe to automate.