The Routing Failure

Capability is present in the residual stream at 74 percent. The output uses 2 percent. Output evaluation is structurally blind to the gap.

A linear probe pulls the right answer off the residual stream at 74 percent. The model outputs 2 percent.

That single result, from a February 2026 paper on attention deficits in language models (arXiv:2602.19239), is the cleanest measurement to date of a phenomenon the field has spent two years tripping over without naming. The information needed to answer the question is sitting on the residual stream, recoverable with a small linear classifier trained for the task. The model's own output head fails to read it. The capability is present. The routing is broken. And every benchmark in production right now is scored on the output, which means every benchmark in production right now is structurally blind to which 72 percentage points of information the model just failed to use.

This is the routing failure. The reliability bottleneck in 2026 agents is not capability and not alignment. It is the gap between what the residual stream knows and what the output uses, and it is being measured in lab settings, named in production incidents, and assembled around in the form of a hyperscaler control plane that shipped before its sensors were validated. The May 1 piece on the supervisory signal layer named the layer the four hyperscalers built. This piece names the failure mode that layer exists to catch.

Procedural Execution Collapses Predictably

On May 1, 2026, Panda, Kadasi, Upperwal, and Singh published "When LLMs Stop Following Steps" (arXiv:2605.00817). The paper is a phenomenon paper of the kind that resets a sub-field. Fourteen models. Fifty-five datasets. The measurement: how procedural execution accuracy degrades as algorithms scale from 5 steps to 95 steps.

First-answer accuracy collapses from 61 percent to 20 percent across the 5-to-95-step range. Under-execution rate rises from 24.25 percent to 50.87 percent over the same interval. Lookback depth alone, scaling from 1 to 7, costs an additional 18.43 percentage points independent of step count. The dominant failure is not arithmetic. The dominant failure is models silently abandoning the procedure mid-trace.

That last sentence is the load-bearing finding. The conventional reading of long-context degradation is that it is an attention or memory problem: the model loses track of earlier tokens, or its working representation gets diluted, or relevant context falls outside the effective window. The Panda et al. paper measures something more specific. The model is not making arithmetic errors at higher rates as steps scale. It is producing fluent continuations that do not execute the next required step. Under-execution and hallucinated extra steps are the dominant failure surface, and they look identical to compliant execution at the output level. The model writes the kind of token a faithful executor would write. It just does not advance the procedure.

That paper does not sit alone. It anchors a tight cluster of recent phenomenon papers measuring what "Attention Deficits in Language Models" (arXiv:2602.19239) called Stage 2B errors: information present in the model's representations, not making it to the output. The 74-percent-probe-versus-2-percent-output result is the cleanest single number, but the cluster includes complexity-induced phase transitions (arXiv:2604.13371, accuracy drops exceeding 50 percent past task-specific thresholds), early-transition entropy spikes that propagate through coherent-looking but globally wrong reasoning (arXiv:2604.14528), overthinking that causes correct answers to be abandoned during extended reasoning (arXiv:2604.10739), and implicit-stopping signals the current sampling stack does not act on (arXiv:2602.08354). Same load-bearing claim across seven papers in ninety days. Capability exists. Routing fails. Output evaluation cannot see this.

This is what the routing failure looks like at scale.

Why Tau-Bench Numbers Lie

The third paper in the cluster is the one that makes the operational implication impossible to dodge. "Beyond Task Completion: Revealing Corrupt Success in LLM Agents" (arXiv:2603.03116) reanalyzed reported successes on the tau-bench agent benchmark and found that 27 to 78 percent of them are what the authors call corrupt successes: trajectories where procedural rules were violated mid-run but the final-answer check still passed.

Read that range carefully. The lower bound is 27 percent. The upper bound is 78 percent. Across different models and task subsets, somewhere between a quarter and three quarters of the wins reported on a leading agent benchmark are not actually clean wins. They are runs where the model silently abandoned a constraint, took an unauthorized shortcut, or violated a procedural invariant, and then produced an output that happened to satisfy the final-answer grader.

If your evaluation grades outputs, every silent abandonment scores as a win. A 70-percent tau-bench score from a model with a 50-percent corrupt-success rate is a 35-percent score from a model that respects the procedure. Vendor benchmark numbers are not reporting reliability. They are reporting an upper bound that includes a substantial fraction of trajectories the model would have failed under any inspection of how it got there. The bench is graded on whether the answer is right. The bench is not graded on whether the model earned it.

This is the structural critique of output-only evaluation that the Panda et al. paper makes from one direction and the corrupt-successes paper makes from the other. Procedural execution collapses predictably with task length. Procedural violations get concealed by passing the final-answer check. The number of agent benchmarks that compute trajectory-level integrity rather than output correctness is small. The number of production agent platforms that compute it is smaller still. The visible accuracy number on a vendor card is doing less work than its readers think it is doing.

Three Practitioners, Three Vocabularies, One Diagnosis

In the same window the papers landed, three independent practitioners converged on the same diagnosis from the opposite direction. None of them cited the arXiv cluster. None of them needed to. They derived it from incident debugging.

GG_Observatory, posting on May 1 through May 3, ran a three-post arc through three different framings of one realization. May 1: agents would silently skip steps and the eval suite would still report pass, so he started building a shadow trace per tool call that logged every raw output separately. May 2: catch exceptions at the eval boundary, because logs from agent-generated code point to line 47 when the actual error is in a dynamically constructed closure. May 3: sourcemaps for agent-generated code are the missing primitive. Three vocabularies, one claim: instrument the trajectory, not the output.

christiannonis, in a stat block that became the most-cited single post of the week, named the same insight as a procurement problem. 88 percent pilot-to-prod failure rate. 73 percent of failed pilots without pre-launch success metrics. 80 percent of Q1 2026 enterprise software embeds an agent. 31 percent of those agents are actually running in production. The load-bearing line: governance happens before the API call or it does not happen. The named incident: an Amazon agent deleted a production region with no pre-execution constraints. Observability after the fact, in his framing, is not governance. It is forensics.

Clawd_God, working from a third direction, named it as an architecture problem. The inspectable-autonomy handoff loop: every agent in a multi-agent system returns evidence, assumptions, confidence, and next-action, not transcripts. The handoff is the inspection point. If the next agent in the chain cannot interrogate the previous agent's reasoning state at a structured level, the failure compounds invisibly until something breaks in production.

Three builders. Three vocabularies. One bottom-up derivation of what the lab cluster measures top-down. Output-level evaluation is structurally insufficient. The failure is in the trajectory. Governance has to happen before execution rather than after. The convergence is not coincidence and it is not group-think. It is what happens when independent practitioners hit the same wall, in the same window, against the same failure mode that no current evaluation framework catches.

What This Looks Like in Production

The Amazon production-region deletion incident is the canonical 2026 example of the routing failure meeting an enterprise budget line. An agent passed whatever evaluation suite the team had it on. The agent took an action that wiped a production region. There were no pre-execution constraints because the prevailing assumption was that an agent that performed well on the eval would behave correctly in deployment. The eval was scored on outputs. The action was a procedure violation that no output check could see in advance.

The 88-percent pilot-to-prod failure rate is this gap meeting procurement. Agents pass the eval and still take down a region. Agents pass the eval and still abandon procedures mid-trace. Agents pass the eval and still produce corrupt successes that score as wins. The pattern is consistent enough that enterprise procurement, which two years ago was asking vendors for accuracy benchmarks, is now beginning to ask for trajectory-level integrity guarantees, pre-execution governance, and structured evidence at every handoff.

The line from christiannonis carries the operational weight here. Governance does not happen in the post-mortem. It happens before the API call or it does not happen at all. Once the agent has invoked a tool, executed a write, or modified a system, the trajectory is committed. Observability tells you what went wrong after it went wrong. Pre-execution governance, in the sense the practitioners are converging on, is the constraint layer that runs in parallel with the agent and intervenes before the action commits. That layer does not exist in any of the major agent platforms today. Several teams are now reasoning about what it would have to contain.

The Layer That Was Built to Catch This

The May 1 piece on this site, The Supervisory Signal Layer, made the structural case that the four hyperscaler agent platforms shipped in April 2026 are Sheridan supervisory control ported to inference. Microsoft, AWS, OpenAI, and Google converged independently on identical architecture: cryptographic identity per agent, central registry, protocol-aware gateway, runtime memory, observability, simulation, evaluation. The piece argued that the hyperscalers had built the chassis but the sensors inside it were not validated. The April arXiv cluster on RAGEN-2 template collapse, contextual entrainment, and overthinking measured a stack of failure modes that the gateway-level metrics those platforms compute cannot detect.

This piece is the other half of that argument. May 1 named the layer. May 4 names the failure mode the layer exists to catch.

The routing failure is the failure mode. It is the in-model version of the same problem the supervisory signal layer was built to solve at the platform level: capability is present in the system, the system fails to route that capability into the action it commits, and the only sensor surface available externally is the output of an action whose internal cause is invisible. The hyperscalers built RTA without validated sensors because the sensors that catch this failure mode do not exist yet in the open observability stack. The chassis shipped before the sensors were validated, and the May 1 piece argued the gap between chassis and sensors is where the next category lives.

The May 4 cluster is what fills in the sensor side of that argument. The Panda et al. paper gives a measured trajectory-integrity signal: under-execution rate per step, lookback depth cost, procedural-faithfulness score on synthetic algorithm traces. The corrupt-successes paper gives a measured trajectory-grading methodology that distinguishes won-with-violation from won-clean. The practitioner posts give bottom-up vocabulary for what those signals look like in a production debugging surface. None of these are platform features yet. All of them are now available primitives a builder can implement.

The shape of the supervisory signal layer is becoming concrete: trajectory-level instrumentation that scores procedural integrity in parallel with the agent, fires a halt signal before a procedure-violating commit, and exposes the trace to the operator at a granularity output evaluation cannot. The hyperscaler platforms ship a place to put this layer. The layer itself is open.

What Changes for Builders Shipping Agents This Quarter

If you are shipping agents in Q2 2026, four concrete moves follow from the routing failure being a real and measurable phenomenon rather than a take.

Treat tau-bench wins and equivalent agent-benchmark numbers as suspect by default. The corrupt-successes paper is not a one-off finding. It is a structural critique of any benchmark that grades trajectories on final-answer correctness, and tau-bench is one of several benchmarks in that category. When a vendor card cites a tau-bench number, the operationally honest reading of that number is to apply a 27-to-78 percent corrupt-success haircut against it. That gets you to the lower bound of trajectory-clean performance, which is closer to what production deployment will see.

Instrument the trace, not the output. Per-turn telemetry on reasoning-trace tokens, tool-call sequences, intermediate-state hashes, and procedural-step attribution is the signal surface where the failure mode is visible. Output-quality dashboards are blind to silent procedure abandonment by construction. The Panda et al. paper measures this directly: a model that produces 2 percent output accuracy on a question whose answer is 74-percent recoverable from its residual stream is a model whose internal state is the only place the failure is detectable. The instrumentation does not require model-internals access. It requires logging trajectory data the harness already has access to and computing integrity scores on it.

Implement pre-execution governance, not post-hoc observability. The Amazon production-region deletion is the limit case of the broader pattern. By the time the action has committed, observability is forensics. Pre-execution constraints, meaning a separate verifier that runs before any agent-issued tool call and can refuse the call before it executes, are the operational instantiation of what control theory calls a runtime monitor with intervention authority. The supervisory signal layer in the May 1 piece is a platform-level version of this. A builder shipping agents this quarter can implement the per-application version with a few hundred lines of harness code and a constraint registry.

Ask vendors what their procedural-integrity signal is, not what their accuracy number is. When a vendor pitches an agent platform, the question that distinguishes a sensor from a chassis is what they measure beyond outputs. Trajectory hash. Tool-call attribution. Procedural-step faithfulness. Halt signals. Intervention authority. Most of the major platforms do not yet have answers to these questions. The ones that develop answers first will be the ones whose procurement story holds up against the 88-percent pilot-to-prod number.

The Forward Claim

The competitive edge in agent infrastructure in 2026 is not the next model. It is whoever instruments the gap between what the residual stream knows and what the output uses. The capability ceiling is high enough that capability is no longer the rate-limiting input. The reliability layer is not built yet, the sensors that would constitute it are now measurable, and the team that ships the layer first owns it.

The investment thesis follows: the next category of company in this space is the trajectory-integrity layer for agent infrastructure. The hyperscalers built the chassis. The papers measured the sensor signals. The practitioners derived the vocabulary from incident debugging. The piece is sitting on the table waiting to be assembled.

Citations and Sources

arXiv Papers

"Attention Deficits in Language Models: Causal Explanations for Procedural Hallucinations." arXiv:2602.19239. February 2026. (74-percent linear-probe recovery on the residual stream against 2-percent model output; the cleanest measurement of "present but not used.")
Panda, Kadasi, Upperwal, Singh. "When LLMs Stop Following Steps." arXiv:2605.00817. May 1, 2026. (14 models, 55 datasets; first-answer accuracy 61 to 20 percent as steps scale 5 to 95; under-execution as dominant failure; lookback depth 1-to-7 costs additional 18.43 percentage points.)
"Beyond Task Completion: Revealing Corrupt Success in LLM Agents." arXiv:2603.03116. (27 to 78 percent of reported tau-bench wins are corrupt successes concealed by final-answer pass.)
"Empirical Evidence of Complexity-Induced Limits." arXiv:2604.13371. (Phase-transition collapse exceeding 50 percent accuracy drop past task-specific complexity thresholds.)
"Dissecting Failure Dynamics in LLM Reasoning." arXiv:2604.14528. (Errors originate at early-transition entropy spikes and propagate through coherent-looking but globally wrong reasoning.)
"When More Thinking Hurts: Overthinking in LLM Test-Time Compute Scaling." arXiv:2604.10739. (Extended reasoning associated with abandoning previously correct answers.)
"On the Implicit Stopping Signal in Reasoning Models." arXiv:2602.08354. (LRMs implicitly know when to stop; capability obscured by current sampling stack.)

Practitioner Posts

christiannonis on X, May 2, 2026. (88-percent pilot-to-prod failure; 73 percent without pre-launch success metrics; 80 percent of Q1 2026 enterprise software embeds an agent; 31 percent in production; named Amazon production-region deletion incident; "governance happens before the API call or it does not happen.")
GG_Observatory on X, May 1 through May 3, 2026. ("Shadow trace per tool call." "Exceptions at the eval boundary." "Sourcemaps for agent-generated code.")
Clawd_God on X, May 2 through May 3, 2026. ("Inspectable-autonomy handoff loop." Every agent returns evidence, assumptions, confidence, next action.)

Prior Signal in This Series

Diamond, Beau. "The Supervisory Signal Layer: Why Every Hyperscaler Just Shipped the Same Thing." beaudiamond.ai/signal/supervisory-signal-layer. May 1, 2026.