Essay · Software & Ideas

The Agreement
That Wasn't

On drift, the poverty of observation, and why two systems that agree today carry no promise about tomorrow

"The map is not the territory."

— Alfred Korzybski, Science and Sanity, 1933

"It is a capital mistake to theorise before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts."

— Arthur Conan Doyle, A Scandal in Bohemia, 1891

There is a class of engineering failure that has no perpetrator. The systems involved are correctly implemented. The data flowing through them is uncorrupted. The engineers who built them acted in good faith and tested their work. And yet, at some point — sometimes months after deployment, sometimes years, sometimes on the occasion of a new operation being introduced that nobody thought to treat as a test — the systems that were supposed to agree stop agreeing, and the reconciliation process that was supposed to catch any discrepancy reports a mismatch whose origin nobody can locate. The audit begins. The logs are examined. No error is found, because there is no error to find. What has occurred is not a failure of implementation. It is a consequence of structure — specifically, a consequence of what it means to observe a computational system rather than to be inside one.

This class of failure goes by the name drift, and it is treated, almost universally, as a symptom of something that went wrong. The assumption is that drift has a cause: a misconfigured parameter, a rounding difference, a race condition, a data transformation that one system applies and another does not. The engineering response is to search for this cause — to instrument more, to log more, to reconcile more frequently, to write more detailed specifications of what each system is supposed to do. These responses are not useless. But they are, in an important sense, misdirected. They treat drift as an anomaly that better engineering would prevent. The argument of this essay is that drift, in systems of any realistic complexity, is not an anomaly. It is a structural certainty — the inevitable consequence of a property that every non-trivial computational system possesses, and that no amount of additional instrumentation, specification, or reconciliation can eliminate. The property in question is this: when you observe a computational system, you see less than is there.

· · ·

To understand what this means requires being precise about what observation is. When two systems are compared — two financial ledgers, two instances of a distributed service, two deployments of a data pipeline — the comparison is never between the systems themselves. It is between the outputs the systems produce: the reports they generate, the API responses they return, the database rows they write, the totals they compute. These outputs are representations of the system's internal state, not the state itself. The internal state of a realistic financial system — the full history of every transaction, the accumulated effect of every rounding decision, the precise sequence of operations that produced each balance — is not directly visible from its outputs. The outputs are a projection of that state: a shadow cast by a high-dimensional object onto a lower-dimensional surface. The shadow is what the comparison sees.

A projection discards information. This is not a criticism; it is the definition of a projection. A report that shows a total balance discards information about the individual transactions that composed it. An API response that returns a customer's account status discards information about the internal sequence of state transitions that produced that status. A reconciliation total that matches between two systems discards information about how each system arrived at the matching number. The information is not wrong; it is simply absent. And the information that is absent is not random. It is precisely the information about how the system got to where it is — the semantic history, the internal path, the structure of the computation — rather than merely where it has arrived.

Two systems whose projected outputs match are not thereby equivalent. They are equivalent in their projections. They may be equivalent in their internal states as well — but this cannot be established by looking at the projections. There may be distinctions between the systems that the projection cannot see, distinctions that are real and consequential, that exist in the only place where computation actually happens — the interior of the system — and that the observation function, by construction, reports as identical. The reconciliation process passes. The audit signs off. The two systems are declared equivalent. And somewhere inside each of them, a difference persists that the declaration cannot reach.

Two systems whose projected outputs match are not thereby equivalent. They are equivalent in their projections. The reconciliation process passes. The difference persists.

· · ·

This might seem like an abstract concern — a philosopher's worry rather than an engineer's problem — if it were not for the fact that systems are not static. Systems receive new data. They are extended with new operations. They are composed with other systems. They process inputs that were not anticipated when the original comparison was made. And it is precisely under further composition — under the application of new operations to states that were previously declared equivalent — that the hidden differences reveal themselves.

The mechanism is straightforward enough to state simply, and deep enough that its implications are not immediately obvious. Suppose two systems have arrived at internal states that differ in some respect invisible to observation — call them state A and state B, both of which produce the same output when the observation function is applied. Now apply a new operation to both systems — a new report, a new calculation, a new data transformation. The new operation acts on the internal state, not on the observation of the internal state. If the operation is sufficiently discriminating — if it is sensitive to the distinction between A and B that the observation function collapsed — the two systems will produce different outputs from the new operation. The hidden difference has been revealed. Drift has occurred. But it has not occurred because anything went wrong between the point of original agreement and the point of revealed divergence. It occurred because the original agreement was an agreement between projections, not between states, and the new operation looked at the states.

This is the central structural fact. Call it what you will — the non-commutativity of observation and composition, the irreversibility of information loss, the latency of divergence. Its practical consequence is that drift cannot be prevented by ensuring that two systems agree at any given moment. Agreement at a moment is a statement about projections. The question that matters for future behaviour is whether the systems agree in their internal states — and this question cannot be answered by looking at the projections. The projection has already discarded the answer.

· · ·

Financial systems are the clearest examples of this structure in practice, partly because their outputs are precisely defined — a balance is a number, a reconciliation total is a number, a mismatch is detectable to the penny — and partly because they are among the most heavily audited systems in existence, which means that the failure of observation-based equivalence is documented with unusual precision. Two ledger systems may process the same sequence of transactions and produce the same end-of-day balances for years. Then a new regulatory report is introduced, or an intraday calculation is required, or a currency conversion is computed in a context that had not previously arisen, and the systems disagree. The natural response is to look for the change that introduced the discrepancy. But there is no change. Both systems implemented the new operation correctly, according to their own specifications. The discrepancy arose because the two systems had been computing slightly different things all along — rounding at different points in a multi-step calculation, applying rules in subtly different orders, representing intermediate values with different precision — and the daily reconciliation had been reporting agreement because it was looking at daily totals, which were the same, rather than at intermediate states, which were not.¹

The engineers who encounter this situation and search for the bug are not wrong to search. They are wrong in what they expect to find. The bug, if there is one, is not in the new operation that revealed the discrepancy. It is not in either system's implementation of any particular rule. It is in the assumption that matching outputs imply matching internal states — an assumption that the reconciliation architecture made implicit and that turned out, as it always eventually does, to be false.

Distributed services exhibit the same structure at a different scale. A service that is deployed across multiple regions, or that is in the process of being migrated from one implementation to another, will typically be validated by running both versions in parallel and comparing their outputs. If the outputs match for a representative set of inputs, the new version is declared equivalent to the old one. The comparison is between outputs — response payloads, status codes, latency distributions. The internal state of the service — its cache contents, its interpretation of ambiguous inputs, its behaviour at the edges of its specification — is not directly compared, because it cannot be directly compared: it is internal, and the comparison has access only to what is external. The new version is deployed. At some point, an input arrives for which the two implementations interpret an ambiguous specification differently. Drift is observed. The engineers discover that the original validation was, in the strictest sense, insufficient — that it established equivalence of outputs for the tested inputs, which is not the same as equivalence of behaviour for all possible inputs, which is not the same as equivalence of internal state, which is the only equivalence that actually matters for future composition.²

The bug is not in the new operation that revealed the discrepancy. It is in the assumption that matching outputs imply matching internal states — an assumption the reconciliation architecture made implicit and that always eventually proves false.

· · ·

The implications of this analysis ramify in directions that are, taken together, uncomfortable for the standard assumptions of systems engineering. The first and most immediate is that reconciliation — the practice of comparing system outputs periodically to detect divergence — is not a solution to drift. It is a delayed notification of drift that has already occurred. By the time a reconciliation mismatch is detected, the systems have already diverged in their internal states; the mismatch is the moment at which a new operation happened to make the divergence visible. The divergence itself may have been accumulating for arbitrarily long before detection. Increasing the frequency of reconciliation narrows the window between divergence and detection; it does not prevent the divergence. The only thing that could prevent the divergence is ensuring that the systems agree not merely in their outputs but in their internal states — and observational reconciliation, by its nature, cannot establish this.

The second implication is stranger and more counterintuitive: improving the abstraction of a system interface — making an API cleaner, reducing the surface area of what a system exposes — tends to make drift worse, not better. This is the opposite of what the standard engineering intuition suggests. Clean APIs are held to be good engineering practice: they hide implementation details, reduce coupling, and allow implementations to evolve independently. All of these properties are genuine. But each of them is also, from the perspective of drift, a mechanism for increasing the gap between what observation can see and what internal states contain. A cleaner API exposes less information about the internal state. Less information about the internal state means more distinct states that produce the same observable output. More states that produce the same observable output means more hidden divergence is possible while observational equivalence is maintained. The very features that make a clean API good for system design are the features that make it worse for drift detection. The architecture that an engineer makes cleaner in the morning has, by that afternoon, increased the probability that two deployments of the system will disagree in ways that the monitoring infrastructure cannot see.³

The third implication is perhaps the most disorienting: changing what you observe can create drift that did not previously exist. This seems paradoxical — how can improving instrumentation introduce a problem rather than reveal one? But the mechanism is straightforward. The question of whether two systems are drifting is relative to a definition of equivalence, and the definition of equivalence is constituted by the observation function — by what is being compared. If the observation function changes, the equivalence relation changes. States that were previously considered equivalent — because the old observation function could not distinguish them — may now be considered divergent, because the new observation function can. The systems have not changed. The data has not changed. The instrumentation has improved. And the improvement has revealed that what was previously called agreement was an artefact of insufficient observability rather than genuine equivalence. The drift did not begin when the instrumentation changed. It was always there. The instrumentation change reclassified it from invisible to visible, which in the accounting of a monitoring system looks identical to the drift having been introduced by the change. Engineers who instrument more carefully and then observe more mismatches have not caused the mismatches. They have simply moved the detection threshold to a point where the existing mismatches become countable.⁴

· · ·

There is a further consequence that applies specifically to machine-learning pipelines, and that is worth separating from the financial and distributed-service cases because its structure is slightly different. A trained model is an observation function applied to the output of a training pipeline. The internal state of the training pipeline — the specific sequence of gradient updates, the order in which training examples were presented, the random seed used for initialisation, the numerical representation used for intermediate computations — is not captured in the model weights. The model weights are the projection: a high-dimensional summary of a process whose details have been discarded. Two training runs that produce models with similar validation metrics are models that agree in their projections. The internal states of the two training runs — and therefore the specific decision boundaries, the specific failure modes, the specific generalisation behaviour on inputs that were not in the validation set — may differ in ways that the validation metrics cannot see.

When a model is retrained, or when a model is compared across hardware configurations, or when a model is used in a pipeline that introduces new downstream operations, the hidden differences between semantically non-equivalent models that are observationally equivalent on the validation distribution can manifest. Two model versions that passed A/B testing — because the A/B test was an observation function that could not distinguish their internal character — may nonetheless behave differently on a new class of inputs, or in a new deployment environment, or when composed with a new feature extraction stage. The machine learning practitioner who treats model evaluation metrics as establishing model equivalence has made the same error as the financial engineer who treats reconciliation totals as establishing ledger equivalence. The metric is a projection. The projection discards information. The discarded information is exactly the information that future composition will depend on.⁵

· · ·

What should be done with this analysis? The answer is not to despair of observation or to conclude that complex systems cannot be managed. The analysis does not say that observation is useless; it says that observation is lossy, and that lossy observation establishes a weaker equivalence than the one that matters for future behaviour. The appropriate response is to be honest about this — to design systems and their governance in ways that acknowledge what observation can and cannot establish, rather than ways that assume it establishes more than it does.

The practical implication for system design is that the choice of what to observe is a design decision with consequences, not merely a monitoring convenience. An observation function that captures more of the internal state — more intermediate values, more of the semantic history of a computation, more of the path rather than merely the destination — reduces the gap between observational equivalence and semantic equivalence. It does not close the gap entirely, but it narrows it. The cost is complexity and data volume; the benefit is that reconciliation failures become more informative and drift becomes more detectable earlier. Systems that are intended to be compared, audited, or reconciled with each other should be designed to expose intermediate states as first-class outputs, not as debugging information appended as an afterthought to an architecture that was designed to hide its working.⁶

The practical implication for reconciliation processes is that a reconciliation should be explicit about what it is establishing. A reconciliation that compares daily totals is establishing that daily totals match. It is not establishing that the systems are semantically equivalent. It is not establishing that the systems will agree under further composition. It is establishing that, for the particular projection defined by the daily total calculation, the two systems currently produce the same number. This is useful information. It is not the information that is usually assumed. Systems and audit processes that confuse these — that treat agreement on a particular projection as evidence of semantic equivalence — will eventually encounter a new operation that distinguishes what the projection collapsed, and will experience that encounter as a surprising and inexplicable failure rather than as the predictable consequence it is.

A reconciliation that compares daily totals establishes that daily totals match. It does not establish that the systems are semantically equivalent. It does not establish that they will agree under further composition. These are different things, and the difference matters.

· · ·

There is a philosophical dimension to this analysis that is worth naming even if it cannot be resolved. The problem of observing a system without fully capturing it is not a problem unique to software. It appears in physics as the measurement problem — the question of what happens to the full state of a system when a measurement extracts only partial information from it. It appears in statistics as the problem of sufficient statistics — the question of which summaries of data preserve all the information relevant to a particular inference and which discard information that may later be needed. It appears in epistemology as the problem of underdetermination — the question of how many distinct theories are consistent with a given set of observations, and what grounds there are for choosing among them. In all of these domains, the structure is the same: a rich, high-dimensional state gives rise to a lower-dimensional observation, and the observation does not uniquely identify the state. Different states that look the same from outside behave differently under operations that the observation does not capture.

The specifically computational form of this problem has the additional feature that the systems in question are composed — built from sequences of operations, each of which takes a state as input and produces a new state as output, with the composition of these operations potentially running to many millions of steps in a realistic pipeline. At each step, the state may be richer than anything the observation function will report. At each step, distinctions may be created that the observation cannot see but that later operations will depend on. The accumulated potential for hidden divergence grows with the length and complexity of the pipeline, which means that the largest and most complex systems are precisely the ones for which observational equivalence is least reliable as a guide to semantic equivalence — and are also, inevitably, the ones for which direct inspection of internal states is most impractical and observation-based comparison is most relied upon.

The systems that are hardest to observe accurately are the systems that most need to be observed accurately. This is not a paradox that engineering can resolve. It is a consequence of the structure of composed computational systems under observation, and it will remain true regardless of how much the monitoring infrastructure improves. Improving the monitoring narrows the gap. It does not close it. The gap is the space between the map and the territory — the space where drift lives, and has always lived, and will continue to live as long as systems are understood through their outputs rather than through themselves.

1The specific failure mode of financial reconciliation under hidden intermediate-state divergence is documented extensively in the risk management literature under the heading of "model risk" — the risk that two implementations of the same specification compute sufficiently different intermediate values that their outputs diverge under certain market conditions. The Basel Committee on Banking Supervision's guidelines on model risk management, and the parallel guidance from the UK's Prudential Regulation Authority, both acknowledge that validation through output comparison is insufficient for establishing model equivalence and require evidence of convergence in methodology — which is a requirement for reducing the gap between observational and semantic equivalence. The practical difficulty of meeting this requirement for large, complex models is a standing challenge in the industry.

2Shadow deployment — running an old and new version of a service in parallel and comparing outputs — is a standard practice in service migration and has been described by practitioners at Google, Netflix, Twitter, and other large-scale service operators. The limitations of shadow deployment as a basis for declaring equivalence are acknowledged in engineering literature on the practice: shadow testing establishes equivalence on the traffic distribution present during the shadow period, which is not guaranteed to cover all possible inputs, and does not establish equivalence of internal state, which matters for inputs not yet encountered. The gap between "agrees on tested inputs" and "agrees on all inputs" is the gap that drift exploits.

3The principle that cleaner abstractions create larger equivalence classes — and therefore more room for hidden divergence — is related to the general trade-off in information theory between compression and discrimination. A lossy compression scheme that achieves a high compression ratio necessarily collapses many distinct inputs to the same compressed output; a lossless compression scheme that preserves discrimination must be less compact. The application to software architecture is that the interface design choices that achieve good encapsulation — hiding implementation details behind clean APIs — are precisely the choices that reduce the discriminating power of the observation function. This does not make clean APIs bad engineering; it makes the trade-off explicit.

4The phenomenon of instrumentation changes creating apparent drift is well-known in distributed systems operations, where improved observability tooling frequently reveals latent disagreements between service replicas that had not previously been visible. The operational difficulty is distinguishing between drift that existed before the instrumentation change (and has been revealed by it) and drift that was introduced by the change itself — for example, by the instrumentation adding overhead that changes timing behaviour. The formal structure described here — that changing the observation function changes the equivalence relation — provides a principled account of why this distinction is difficult: the new instrumentation is a different observation function, and therefore establishes a different notion of equivalence, making before-and-after comparisons of "how much drift exists" formally ill-posed.

5The problem of two machine learning models that agree on a validation distribution but behave differently on out-of-distribution inputs is discussed in the literature on distributional shift and dataset shift. The analysis here provides a complementary perspective: the validation metric is an observation function, and the equivalence it establishes is relative to that function. Two models that are equivalent under the validation metric may differ in the internal representations they have learned — the features they attend to, the decision boundaries they have drawn, the implicit assumptions they have made about the input distribution — and those differences will manifest when the models are composed with new pipelines, fine-tuned on new data, or applied to inputs outside the validation distribution. The standard practice of comparing models by validation metrics establishes observational equivalence, not semantic equivalence, and the distinction matters precisely when the models are used in contexts that the validation set did not anticipate.

6The idea of exposing intermediate computational states as first-class outputs — rather than treating them as implementation details hidden behind a clean interface — is related to the concept of event sourcing in software architecture, in which the full history of state-changing events is preserved as the primary record, with current state derived from the history rather than stored independently. Event-sourced systems have higher storage costs and greater complexity in their read models, but they preserve the semantic history that observation-only architectures discard. The trade-off between storage cost and semantic fidelity is precisely the trade-off between the discriminating power of the observation function and the compactness of what it records. Systems designed for contexts where future reconciliation and audit are important — financial systems, compliance-regulated platforms, safety-critical data pipelines — have independent operational reasons to prefer event sourcing over state-only storage; the analysis here provides a theoretical grounding for why this preference is well-founded.

The AgreementThat Wasn't

The Agreement
That Wasn't