Your Quality Workflows Are Sequential. That Is Where LLMs Fail.

At Dynatrace Perform 2026 in February, chief technology officer Bernd Greifeneder measured what happens when you chain LLM agents together in production: a model operating at 95% accuracy drops to approximately 60% reliability after 10 sequential agentic calls. The finding came from production testing across enterprise observability deployments, not a benchmark lab. It describes a failure mode that does not show up in single-task evaluations because single tasks do not expose it.

The arithmetic is not complicated. If a step succeeds with 95% probability, and steps are chained in sequence, the combined probability of completing ten consecutive steps without error is roughly 0.95 raised to the tenth power, which is approximately 60%. The model does not degrade. Sequential probability compounds. A 5% error rate per step produces a 40% combined failure rate across ten steps, and most of those failures are invisible at the end of the chain because the intermediate steps generated coherent, well-formatted output that carried the error forward without flagging it.

Pharmaceutical quality workflows are sequential by regulatory design. An out-of-specification investigation begins with a result evaluation. That evaluation determines whether an assignable cause exists. The assignable cause determination shapes the scope of the batch impact assessment. The impact assessment defines which batches require additional examination. That examination informs the CAPA scope and timeline. The CAPA narrative justifies the corrective action. Each link in that chain is a judgment call. If any part of that chain is AI-assisted, and if the AI component operates generatively without deterministic grounding, the 10-step failure math applies. The quality decision at the end of the chain carries uncertainty from every step before it.

Greifeneder's team built an architecture to contain this. Three deterministic AI agents, each handling specific analytical tasks including root cause identification, analytics, and forecasting, establish causal grounding using real-time topology and structured data before any generative AI enters the workflow. Testing this architecture against LLM-only approaches across production deployments produced measurable differences: 12 times higher success rates, resolution three times faster, and 50% fewer tokens consumed. United Airlines runs Dynatrace across more than 2,000 application services. A single boarding pass transaction triggers 500 services. Previously, diagnosing a major incident in that environment required 250 people. Deterministic grounding reduced that materially.

The architecture that produced those results will be recognizable to anyone who has designed a validated quality system. It describes what well-built computer systems validation environments already do: establish what is known and correct through rule-based deterministic logic before any generative or interpretive process is applied to that data. The problem with current AI adoption in quality operations is that the generative layer is frequently introduced first. An LLM drafts the investigation summary. Another synthesizes CAPA rationale from prior deviations. A third generates the batch record narrative. Each sits at the end of a chain that may have started with structured data but produces interpretive output that the next step inherits as context. That inherited context is not validated. It compounds.

The regulation governing unexplained discrepancies in drug manufacturing, 21 CFR 211.192, requires that investigations determine an assignable cause before release and that similar failures be examined across batches. What it does not contemplate is an investigative workflow where the logic determining assignable cause was generated by a system without defined constraints on evidence sufficiency, escalation threshold, or what constitutes a satisfactory investigation. Those constraints exist in validated procedural workflows for exactly the reason Dynatrace's production testing surfaced: sequential systems that make judgment calls without deterministic grounding drift. The results are internally coherent and traceable only to the model's prior output, not to the source records that should anchor them.

The fix is architectural, not prompt-based. Better prompts do not change the sequential probability math. Larger context windows do not prevent compounding uncertainty; they typically increase it by giving the model more material to reason incorrectly about across ten steps. What changes the math is the sequencing decision: deterministic validation before generative reasoning. In quality operations terms, that means a validated rule engine determining whether a result requires an investigation before an LLM interprets what the result means. It means structured queries against LIMS returning specific records before an agent synthesizes findings. It means human-defined escalation criteria codified before generative summarization begins.

In that sequence, the workflow is auditable in the way that satisfies 21 CFR Part 11 requirements for record authenticity and integrity. The deterministic layer produces records that are what they appear to be: structured, traceable, generated by defined logic. The generative layer, constrained by validated input, is asked to reason about a defined set of confirmed facts. That is a different inspection conversation than explaining why the batch release narrative was generated by a model that received its context from prior model output four steps upstream.

The defense and aerospace sector is already designing production AI this way. At Perform 2026, David Walker from Lockheed Martin presented integration work combining MCP-based tool orchestration with Dynatrace for production AI observability across enterprise operations. That sector is not ahead because regulation forces it. It is ahead because sequential AI failures in production are visible and measurable when you instrument them correctly, and the architecture that prevents them has been tested. The observation that generative AI placed incorrectly in a workflow degrades faster than expected is not novel to the teams building these systems at scale. It is a design problem they have already moved past.

Pharma quality teams are solving it now, often without the production observability to see the failure mode clearly. The workflow produces output. The output looks authoritative. The final reviewer approves what appears to be a thorough investigation. What is not visible is how many of the intermediate steps inherited a small error from the prior step and propagated it forward, coherently, into the final narrative. That gap does not appear on a dashboard. It appears in a 483 observation when an inspector asks for the underlying records and finds that the investigation logic cannot be traced back to them.

Greifeneder's 60% figure is a production measurement from one of the most instrumented enterprise AI environments in existence, using a 95% accurate model across 10-step workflows. Real quality investigation chains are longer than 10 steps. The model accuracy applied to complex regulatory reasoning tasks is typically lower than 95%. The compounding math produces a worse outcome under those conditions. Before deploying AI into any multi-step quality workflow, the design question to answer is whether the deterministic grounding layer exists before the generative layer is asked to reason. If it does not, the 60% reliability ceiling is not a prediction. It is a generous starting estimate.

Dealing with a related issue?

If this article hits close to home, DSRV can help you assess the situation and frame a response strategy — confidentially, within 48 hours.

Scan Your Documents Learn about Response Strategy

Share this articleLinkedIn X / Twitter

DI

DSRV Intelligence

AI Pharmaceutical Quality Intelligence · DSRV Founder

Thedson is a pharmaceutical stability and quality professional with deep expertise in regulatory science, ICH guidelines, and pharmaceutical quality systems. He founded DSRV to make high-quality regulatory intelligence accessible to professionals at every career stage.

Your Quality Workflows Are Sequential. That Is Where LLMs Fail.

Dealing with a related issue?

Get expert pharmaceutical intelligence in your inbox

Related articles

Your Next 483 Observation Is Running on a Timer

MCP Goes Stateless on July 28. For Pharma AI Teams, That Is a Change Control Event.