The pattern we see

A team commissions an "AI-powered" automation. The vendor builds a system where an LLM does most of the work: reads the inputs, applies the rules, generates the output. The demo looks impressive. The system works on the cherry-picked test cases.

Three months into production, the team is back to manual review. The system is too slow, too expensive, and too unreliable. It hallucinates in ways that are difficult to catch. The cost of running it exceeds the cost of the human work it was supposed to replace.

The diagnosis is almost always the same: the LLM is doing work that a deterministic script could do better.

The rule

If a regex can do it, the regex does it. If a script can do it, the script does it. The LLM is only used where language judgment is genuinely required.

The discipline is simple. The implementation is where most teams quietly fail.

What "language judgment" actually means

There is a small set of operations where LLMs add value over deterministic code:

  • Turning semi-structured text into structured fields. Reading a receipt and extracting amount, date, vendor. The text varies in format; the output is structured. This is reading.
  • Classifying a document by content rather than filename or metadata. Filenames lie. Folder structures lie. The only reliable way to know what a document is is to read it. This is reading.
  • Generating narrative around verified numbers. Producing the prose of an investor report from a structured data feed. The numbers come from the source; the LLM writes the explanation. This is writing.
  • Detecting that something is unusual in a way that is hard to specify in rules. "This invoice looks different from your typical invoices from this vendor in a way I cannot quite name." Anomaly detection with explanation. This is judgment, narrowly applied.

That is roughly the list. Other tasks people use LLMs for, that deterministic code can do better:

  • Pattern matching. Regex is faster, cheaper, more reliable.
  • Arithmetic. LLMs are bad at math. Use Python.
  • Date manipulation. Use a date library.
  • Sorting, filtering, joining. Use pandas or SQL.
  • API calls. Use a script.
  • Validation against a fixed rule set. Use a rules engine.

When an LLM is used for these, the system inherits the LLM's failure modes (occasional hallucinations, inconsistency across runs, latency, cost) for no reason. The deterministic version is faster, cheaper, and easier to debug.

The cost of getting this wrong

Three costs compound when the wrong split is chosen.

Operating cost. LLM calls are not free. A pipeline that uses the LLM for every step might cost ten to fifty times what a deterministic backbone with a narrow LLM layer costs. At scale, this matters.

Reliability. Deterministic code either works or does not, and the failure modes are reproducible. LLM-based pipelines have probabilistic failure modes. A rule that fired correctly on Monday might not fire on Tuesday, not because the data changed, but because the model's response shifted. When the pipeline is the audit trail, this variance is unacceptable.

Debuggability. When a deterministic pipeline produces a wrong output, you can trace through the logic and find the bug. When an LLM-based pipeline produces a wrong output, the question "why" has a probabilistic answer at best. Months later, when the team is troubleshooting in production, this asymmetry compounds.

The hybrid architecture in practice

A well-built pipeline has three layers, and the LLM lives in only one of them:

Layer 1. Deterministic backbone. Code that pulls data, transforms formats, matches records, applies the rules. No LLM. This layer does most of the work in any pipeline. It runs in milliseconds. It is testable in CI. It is debuggable line by line.

Layer 2. Narrow LLM application. The places where reading is genuinely required. The LLM is called with a tightly-scoped prompt, returns structured output (JSON, with a schema), and is validated by downstream code. The LLM is allowed to fail, because downstream validation catches the failures.

Layer 3. Output and audit. Code again. The structured output from Layer 2, validated against business rules from Layer 1, produces flags, reports, and audit trails. The deterministic backbone records what happened.

This architecture inverts the typical "AI build" pattern. The LLM is a small, well-contained component. The system around it is engineering. The result is reliable, debuggable, and affordable.

Why teams resist this

The hybrid architecture is less impressive in a demo. "We use AI throughout the pipeline" is better marketing than "We use AI in one carefully-bounded layer of a system that is mostly Python and SQL."

Buyers reward demo impressiveness. So vendors build LLM-heavy pipelines. So the systems are flaky in production. So the buyers go back to manual review. The cycle is self-reinforcing.

The discipline of "deterministic where possible" is unpopular precisely because it is correct. The systems that survive in production are unglamorous on the inside. The glamour was a tax the buyer paid for accepting the wrong architecture.

When to use the LLM more

The rule is "where required," not "never." There are situations where the LLM does more work than the hybrid pattern suggests:

  • When the input is genuinely unstructured (long-form narrative documents where structured extraction is the entire task).
  • When the variability of the input space exceeds what rules can describe (customer support emails, narrative reports, free-text feedback).
  • When the output is itself narrative (drafting reports, writing summaries, generating explanations).

Even in these cases, the LLM's output is validated by deterministic checks where possible. Numbers in a generated report are verified against the source. Classifications are cross-checked against a labeled validation set. The LLM is allowed to drive in places where deterministic code cannot, but the deterministic checks are there to catch the failures.

The summary

The decision rule is one sentence. Use the LLM only where language judgment is genuinely required. Everything else is engineering.

The discipline is what separates the systems that work in production from the systems that look good in demos.