Essay · Software & Ideas

The Mutual
Deception

On the coevolution of C and x86-64, the billion-transistor fiction at the heart of modern computing, and why the compiler was always smarter than the programmer

"A programming language is low level when its programs require attention to the irrelevant."

— Alan Perlis, Epigrams on Programming, 1982

"The features that led to Meltdown and Spectre were added to let C programmers continue to believe they were programming in a low-level language, when this hasn't been the case for decades."

— David Chisnall, C Is Not a Low-Level Language, ACM Queue, 2018

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."

— Donald Knuth, 1974

There is a belief, held with the fervour of religious conviction by a certain kind of systems programmer, that C is close to the metal — that writing C is tantamount to writing the instructions that the machine actually executes, and that any other language interposes an inefficiency between the programmer's intention and the processor's action. This belief is false. It was arguably false in 1978. It is certainly false now. What is true, and more interesting, is that the belief was so widespread and so commercially consequential that the processor industry spent fifty years and billions of transistors making it appear true — constructing, inside each successive generation of Intel silicon, an enormously complex machine whose sole purpose was to maintain the illusion that the simple sequential model that C programmers carry in their heads bore some relationship to what the hardware was actually doing.

The story of C and x86 is, at its most accurate, a story about mutual deception: a language and an architecture that fed each other's myths, each pretending to be what the other needed it to be, growing more elaborate in their pretence with each passing decade, until the pretence itself became the source of the most significant security vulnerabilities in the history of computing. It is also, looked at from a different angle, one of the most successful co-evolutionary partnerships in the history of technology — two organisms that shaped each other so thoroughly that neither can be understood without the other, and whose combined dominance of the computing landscape has persisted for half a century despite, or perhaps because of, being founded on a fiction.

· · ·

To understand what C actually is requires going back to what it was designed to be, which is something considerably more modest than its mythology suggests. C was designed by Dennis Ritchie at Bell Labs between 1969 and 1973, as a language for writing the Unix operating system. Its predecessor, B, was itself derived from BCPL, and the lineage is important: these were not languages designed to model computation in the abstract or to impose theoretical rigour on programming practice. They were practical tools for getting an operating system written on machines with very limited resources — the PDP-7, and then the PDP-11, whose architecture would leave its fingerprints on C in ways that only became visible decades later, when the architecture the fingerprints came from had long since ceased to exist.¹

The PDP-11 was a 16-bit minicomputer with a flat address space, a small set of general-purpose registers, and a memory model in which arrays and pointers were essentially the same thing. C's abstract machine — the model of computation that C programs implicitly assume — is a direct reflection of the PDP-11's architecture. Integers have sizes reflecting PDP-11 register widths. Pointers are integers, arithmetic on them is well-defined, and arrays decay to pointers because on the PDP-11 they were pointers. The stack grows downward because the PDP-11's stack grew downward. The calling convention — arguments pushed in a certain order, return value in a register, caller cleans the stack — reflects PDP-11 conventions. C is not an abstract language that happened to be implemented on the PDP-11. C is the PDP-11's programming model, given a syntax and a compiler.

This was not a criticism in 1973. The PDP-11 was the machine. Writing C was, genuinely, writing for the metal, because the metal was simple enough that the correspondence between source and execution was direct and legible. A C programmer in 1973 who wrote a pointer dereference could form a reasonably accurate picture of what the machine would do: fetch a value from a memory address. A loop in C corresponded to a loop in the instruction stream. A function call pushed arguments and jumped to a subroutine. The abstraction gap between the C source and the executed instructions was thin enough to see through.

C is not an abstract language that happened to be implemented on the PDP-11. C is the PDP-11's programming model, given a syntax and a compiler.

· · ·

What happened next is the story that the mythology obscures. The PDP-11 was superseded. The architecture that succeeded it in the commercial mainstream was not designed with C's abstract machine in mind — it was designed for other reasons, by engineers at Intel whose primary concern was compatibility with their own previous products, and it accumulated the features and compromises of each successive generation while dragging the full weight of everything that had come before. The 8086, released in 1978, was a 16-bit processor descended from the 8-bit 8080, itself descended from the 8008. Its register set was asymmetric, its addressing modes were irregular, and its segmented memory model — in which a 20-bit address was formed from a 16-bit segment and a 16-bit offset — was a baroque solution to the problem of addressing more than 64 kilobytes of memory without redesigning the register file. The 8086 was not a fast PDP-11. It was a fast 8080 that had been extended to 16 bits in a way that preserved 8080 programs while adding new capabilities that did not compose cleanly with the old ones.²

IBM's selection of the 8088 — the 8086 with an 8-bit external data bus, cheaper to manufacture — for the IBM Personal Computer in 1981 was a contingent commercial decision that locked the x86 architecture into the dominant position it has occupied ever since. The engineers at Intel did not know, in 1978, that they were designing the processor that would run the world's software for the next half century. But the IBM PC's success made the x86 instruction set the target that all subsequent processor development had to maintain compatibility with, and compatibility is a ratchet: once a commitment is made, it can be extended but never withdrawn. Every mode, every quirk, every irregularity of the 8086 exists in every x86 processor sold today, not because it is useful, but because programs compiled to assume it may still exist somewhere, and Intel will not be the one to break them.

The 386, in 1985, extended the architecture to 32 bits — flat address space, registers widened to 32 bits, the segmented memory model preserved but made optional by a protected flat mode that became the universal choice of operating system designers. The 386 was the x86 architecture that C programmers actually wanted: flat memory, word-sized integers, a calling convention that felt almost like the PDP-11 model they had inherited. Unix was ported to it. Linux was written for it. The ecosystem consolidated around the 32-bit x86 model, and for a decade the correspondence between C's abstract machine and the hardware's actual behaviour was close enough that the fiction of C as a low-level language was maintained at modest cost.

· · ·

The cost began to rise in the 1990s, as processor designers discovered that the way to make programs run faster was not to execute instructions faster but to execute more instructions simultaneously — and that the instructions being executed were C instructions, which were sequential by construction and had to be analysed, reordered, and parallelised by the processor in real time without the program's knowledge. This is the hardware feature called out-of-order execution, and its significance for the C story cannot be overstated: it is the point at which the processor stopped being a machine that executed C's sequential abstract model and became a machine that secretly did something entirely different while maintaining the appearance of executing C's sequential abstract model.

An out-of-order processor does not execute instructions in the order the compiler placed them. It analyses a window of upcoming instructions — potentially hundreds of them in a modern processor — identifies which ones are independent of each other, issues them to execution units in whatever order permits the most parallelism, and presents the results to the program in the sequential order the program expects, using a buffer called the reorder buffer to hold completed results until they can be committed in program order. The program sees a sequential machine. The machine is doing something that bears no sequential correspondence to the program. The C programmer's mental model — pointer dereference fetches a value, the next instruction sees that value — is maintained by hardware that is performing a continuous real-time analysis of the instruction stream and scheduling its execution for maximum throughput, behind a carefully maintained façade of sequential simplicity.³

Branch prediction compounds this. Modern processors do not wait to find out whether a conditional branch is taken; they guess, using elaborate prediction tables built from the history of the branch's previous behaviour, and begin executing instructions down the predicted path before the condition has been evaluated. If the prediction is correct — and for well-structured code, modern predictors are correct more than ninety-five percent of the time — the pre-executed instructions are committed and execution continues with no visible stall. If the prediction is wrong, the speculatively executed instructions are discarded, the processor restores its state to the point of the branch, and execution resumes down the correct path. From the program's perspective, the branch behaved correctly. From the hardware's perspective, instructions that were never supposed to execute ran to completion and were then undone — or almost undone. The side effects of speculative execution on the processor's internal state, including the cache, are not always fully undone. And this is precisely where Spectre and Meltdown live.

The C programmer's mental model is maintained by hardware performing a continuous real-time analysis of the instruction stream behind a carefully maintained façade of sequential simplicity.

· · ·

David Chisnall, a researcher at the University of Cambridge, made this argument precisely and publicly in a 2018 paper in ACM Queue titled "C Is Not a Low-Level Language," whose subtitle — "Your computer is not a fast PDP-11" — is the most efficient possible summary of fifty years of architectural history. Chisnall's thesis is that the Meltdown and Spectre vulnerabilities, disclosed in early 2018 and affecting virtually every modern processor, were not accidents or oversights. They were the direct consequence of building processors that had to appear to execute the C abstract machine — sequential, predictable, pointer-visible — while actually executing something radically different for performance reasons. The speculative execution that created the attack surface for Spectre was not a mistake. It was an engineering requirement: without it, the processor could not run C code fast enough to compete. The fiction that C was close to the hardware required the hardware to perform enormous amounts of hidden work, and the hidden work created the vulnerability.⁴

Chisnall's paper is the most pointed statement of an argument that processor architects had been quietly aware of for years: that the x86 instruction set, as an interface between software and hardware, had become an elaborate theatrical performance. The actual computation on a modern x86 processor does not happen in the form of x86 instructions. The x86 instructions — mov, add, cmp, jmp and their hundreds of companions — are decoded by the processor's front end into a different, internal instruction format: micro-operations, or µops, which are the actual instructions that the processor's execution units act on. The x86 instruction set is, in this sense, a scripting language: a human-readable notation that the processor interprets and translates into its own internal language before doing any real work. The interpreter is implemented in silicon rather than software, but the structure is the same. The source text is the x86 assembly. The bytecode is the µop stream. The runtime is the out-of-order execution engine.

· · ·

This observation — that x86 assembly is more scripting language than machine language — has consequences that the industry has been slow to absorb, because they undermine the C programmer's foundational self-image. If x86 assembly is an interpreted language with its own runtime optimiser, then writing C "close to the metal" is not writing for a machine; it is writing for an interpreter whose optimisations are not under the programmer's control and whose behaviour the programmer cannot directly observe. The gap between what the C programmer writes and what the silicon does is not the gap of a compiler optimisation pass. It is the gap between a script and the engine that runs it.

AMD's introduction of x86-64 in 2003 — the 64-bit extension to the x86 architecture, adopted by Intel under the name Intel 64 — is the point at which the scripting-language nature of x86 assembly became most apparent. x86-64 is not a new instruction set. It is the old instruction set, extended: the existing registers widened to 64 bits, eight new general-purpose registers added (bringing the total to sixteen, compared to eight in 32-bit mode), a new calling convention defined that passes arguments in registers rather than on the stack, and the flat 64-bit address space made the default mode of operation. The extensions were designed by AMD's architects to be backwards compatible with 32-bit x86 code, which meant they had to preserve all the quirks and asymmetries of the 32-bit architecture while adding enough new capability to make 64-bit programming tractable. The result is an instruction set of baroque complexity — over a thousand distinct instructions in the base set, with SIMD extensions adding hundreds more — that no human programmer writes directly and that the processor never executes directly.⁵

Modern x86-64 processors contain what amounts to a RISC processor at their core: a fast, regular, load-store architecture with a clean instruction format, executing µops derived from the x86-64 instruction stream. The front end — the decoder, the µop cache, the branch predictor, the instruction fetch unit — is the part of the chip dedicated to maintaining the x86-64 façade. It consumes a significant fraction of the chip's area and power budget. Intel's Skylake microarchitecture, for example, devotes substantial die area to the complex instruction decoder and µop cache that translate x86-64 instructions into the internal format; the actual execution units, which are where the arithmetic and logic happen, are a smaller fraction of the total. The processor's dominant engineering challenge, in other words, is not computation. It is translation — the continuous, high-throughput conversion of a legacy scripting language into a form the hardware can efficiently process.

· · ·

The consequence of this architecture for the question of programming language performance is more radical than most programmers have accepted. If the processor is a µop engine that accepts x86-64 assembly as a scripting language, then the performance of a program is determined not by which source language it was written in, but by the quality of the µop stream that the compiler generates from it. A Fortran compiler that generates good x86-64 will produce programs that run at the same speed as a C compiler that generates equally good x86-64. A Rust compiler that generates excellent x86-64 will produce programs that run faster than a C program compiled by a compiler that generates mediocre x86-64. The source language is irrelevant to the execution speed. The only thing that matters is the machine code — and the machine code, in turn, is only the scripting language that the actual machine runs.

This is not a theoretical proposition. It is measurable. GCC's optimiser — which handles C, C++, Fortran, Ada, Go, and D, among others — generates, from each of those languages, an intermediate representation that is language-agnostic, applies optimisations at that intermediate level, and produces x86-64 assembly that reflects the optimisations rather than the source language. A Fortran program compiled by GCC with -O3 and a C program compiled by GCC with -O3 pass through the same optimisation pipeline. The generated assembly differs only in ways that reflect the programs' different semantics, not the languages' different reputations. LLVM, the compiler infrastructure that underlies Clang, Swift, Rust, Julia, and many other languages, goes further: it defines an intermediate representation, LLVM IR, that is the actual language in which optimisations are expressed, and any language that can be compiled to LLVM IR inherits the full optimisation capability of the LLVM backend without writing a single line of assembly.⁶

The practical result is something that genuinely surprises programmers trained on the C-as-metal mythology: GCC and Clang, given sufficiently expressive source languages, will often produce faster code than a programmer writing C would, because the compiler has access to information that the programmer does not. A C programmer who wants to sum an array of floats writes a loop. The compiler, given appropriate flags and an architecture that supports it, will auto-vectorise the loop — recognise that the independent iterations can be computed simultaneously using the processor's SIMD instructions — and emit code that computes four or eight or sixteen values simultaneously. The C programmer who tries to match this by hand must write intrinsics: function calls that correspond directly to SIMD instructions, in a notation so arcane that it is indistinguishable from assembly. The compiler knows the architecture better than the programmer. It knows which instruction sequences can be fused by the processor's macro-op fusion unit. It knows which memory access patterns will pollute the cache. It knows which branch structures are amenable to branchless execution using conditional moves. The programmer who writes clean, idiomatic C and allows the compiler to do its work will frequently produce code faster than the programmer who writes clever, low-level C intended to control the machine directly.

The compiler knows the architecture better than the programmer. The programmer who writes clean, idiomatic code and allows the compiler to work will frequently beat the programmer who writes clever, low-level code intended to control the machine.

· · ·

The clearest demonstration of this principle is the behaviour of modern compilers when presented with undefined behaviour in C. C's standard contains a substantial category of operations whose results are, by specification, undefined — signed integer overflow, dereferencing a null pointer, reading from a variable before it is initialised. The C programmer's mental model, shaped by the PDP-11 inheritance, suggests that these operations have specific behaviours on real hardware: signed integer overflow wraps around, null dereferences crash, uninitialised variables contain whatever the register or stack happened to hold. The compiler's perspective is different: undefined behaviour is a promise from the programmer that the operation will not occur, which licenses the compiler to assume it does not occur and to optimise accordingly. A loop whose termination depends on signed integer overflow will, under GCC with optimisation, be transformed by the compiler into an infinite loop — because the compiler is permitted to assume that the overflow never happens and to eliminate the code path that would handle it. This is not a bug. It is the logical consequence of C's specification, correctly applied.

This is the point at which the myth collapses most completely. The C programmer who believes they are controlling the machine directly is, in the face of an optimising compiler, controlling nothing directly. The compiler is free to rearrange, eliminate, fuse, and transform any operation as long as the resulting code is observationally equivalent to the source under the assumption that undefined behaviour does not occur. The machine that is actually executed may bear no structural resemblance to the source that was written. The programmer who reached for C because they wanted to be close to the hardware has, through the mechanism of undefined behaviour optimisation, handed the compiler more freedom to deviate from the source's apparent meaning than any other language's specification provides. The machine is a µop engine running a scripting language generated by an optimiser that the programmer does not control and cannot inspect without reading assembly. The myth was always false. The undefined behaviour optimisations reveal, at the source level, just how false.

· · ·

The x86-64 story has a second strand that the performance discussion obscures: the strand of backwards compatibility as both constraint and miracle. The decision that AMD made in 2003 to extend x86 to 64 bits in a way that preserved the existing software ecosystem was not architecturally clean. A genuinely clean 64-bit architecture — which Intel attempted with Itanium, released in 2001 — would have broken compatibility with the existing x86 software base and required recompilation of the entire ecosystem. Itanium was architecturally interesting and commercially catastrophic: without the ability to run existing x86 software at full speed, it offered nothing to the enterprise customers who were Intel's primary market, and it was discontinued in 2019 after eighteen years of increasingly marginal deployment. AMD64, by extending the existing ISA, gave customers a migration path rather than a replacement, and its adoption was total within five years of its introduction.⁷

The cost of this compatibility is carried in the processor's front end: the complex decoder that handles 32-bit instructions in 64-bit mode, the legacy mode that executes 16-bit real-mode code for BIOS compatibility, the virtual-8086 mode that exists for operating systems that need to run 16-bit protected-mode code, the various compatibility modes layered on top of each other in a stack of historical obligation. A modern x86-64 processor is, architecturally, a 64-bit RISC core wrapped in fifty years of compatibility obligations, expressed as a variable-length instruction set with encodings ranging from one to fifteen bytes, decoded by a front end of extraordinary complexity into a clean internal format that the core can actually work with. The assembly language programmer who believes they are programming the hardware is programming the outermost layer of this stack — the compatibility façade — and relying on the processor's internal machinery to translate their instructions into something the actual hardware executes.

RISC-V, the open-source instruction set architecture developed at Berkeley and released in 2014, makes the implicit explicit. RISC-V has a fixed 32-bit instruction width, a clean and regular register file, a load-store architecture, and no legacy compatibility obligations whatsoever. It is what a clean-sheet 64-bit architecture looks like when the designers are not constrained by the need to execute MS-DOS programs. It is also, in the data centres and embedded systems where it has gained significant deployment, no faster than x86-64 at executing compiled programs — because the processor's performance is determined by the quality of the out-of-order execution engine behind the instruction set, and a well-implemented x86-64 out-of-order core is a better investment than a poorly-implemented RISC-V out-of-order core, regardless of the relative cleanliness of the instruction sets being decoded. The instruction set is the scripting language. The execution engine is the runtime. The scripting language's elegance matters less than the runtime's quality.⁸

· · ·

The programming language that is most honest about this situation is not C. It is, unexpectedly, every language that targets LLVM IR — because LLVM IR makes the fiction explicit rather than implicit. When a Rust program is compiled, it is not compiled to x86-64 assembly. It is compiled to LLVM IR, which is then compiled to x86-64 assembly by LLVM's backend. The intermediate representation is the point at which the source language ends and the machine begins — and the machine, at that point, is not x86-64 but a platform-independent instruction set whose only job is to be a convenient target for optimisation passes before the final translation to whatever assembly the target processor accepts. The Rust programmer who believes their code is close to the metal is correct in a sense they probably do not intend: they are close to LLVM IR, which is close to µops, which are the instructions the processor actually executes. The x86-64 assembly that appears between LLVM IR and µops is an intermediate format, a scripting language in a pipeline that begins with source code and ends with silicon.

This reframing dissolves the performance hierarchy that the C mythology imposes. If all compiled languages generate LLVM IR or GCC's RTL intermediate representation, and if those intermediate representations are compiled to x86-64 assembly by the same backend, and if that assembly is decoded by the same processor into the same µop stream and scheduled by the same out-of-order engine, then the performance difference between a Rust program and a C program is not a matter of Rust's distance from the metal relative to C's distance from the metal. It is a matter of how well each language's frontend generates the intermediate representation, how well the intermediate representation represents the program's semantics in a way that permits optimisation, and how well the backend's code generator maps those semantics to the target ISA. Languages with richer type systems — that know more about what the programmer intended — can in principle allow the compiler to generate better code than C, because C's type system is too impoverished to express the aliasing and mutation constraints that would permit the most aggressive optimisations. Rust's ownership system, which guarantees that no two live references alias the same memory, is an aliasing analysis that the Rust compiler gets for free and that the C compiler must either assume conservatively or infer approximately. The richer the language's semantics, the more information the compiler has; the more information the compiler has, the better the µop stream it can generate.⁹

Languages with richer type systems can in principle allow the compiler to generate better code than C, because C's type system is too impoverished to express the constraints that permit the most aggressive optimisations.

· · ·

The deepest irony of the C-x86 coevolution story is that the relationship ran in both directions, and that the direction from x86 to C is the less acknowledged one. C shaped x86, because Intel and AMD's compiler teams needed to generate fast C code and adjusted the architecture's microarchitectural features to facilitate it. But x86 also shaped C — not through any formal process, but through the accumulated assumptions of generations of C programmers who believed they were programming a PDP-11, wrote code that assumed the PDP-11 memory model, and whose assumptions became, through sheer quantity, the de facto standard for what C programs could expect. Compiler writers who discovered that their optimisations produced incorrect results on programs that C's standard permitted them to transform discovered that the programs had been written by programmers who assumed PDP-11 semantics, and the undefined behaviour that the standard permitted the compiler to exploit was the undefined behaviour that those programmers had relied on without knowing it. The standard was revised — and revised again — to accommodate the reality of deployed programs rather than the theory of the language's design. C's de facto standard is, in several respects, the union of what the PDP-11 did and what the x86 does, codified by experience rather than by intention.

Chisnall's 2018 paper argues that the exit from this situation requires acknowledging that the C abstract machine is not and has not been a description of real hardware for decades, and designing future languages around the hardware that actually exists — processors with deep pipelines, aggressive speculation, SIMD execution units, complex memory hierarchies, and parallelism at every level — rather than the hardware that C implies. The languages being built in the 2010s and 2020s — Rust, Swift, Julia, Zig — are not uniformly designed with this goal explicitly in mind, but they share a structural feature that points toward it: they give the compiler more information about the programmer's intentions than C does, and they are therefore better positioned to take advantage of what the processor actually is rather than what C's abstract machine pretends it to be.

The programmer who wants to control the hardware should learn what the hardware is. It is a µop engine with a speculative out-of-order core, a deep cache hierarchy, SIMD execution units capable of processing sixteen 32-bit integers simultaneously, a branch predictor of uncanny accuracy, and a front end that accepts x86-64 assembly as input and treats it as a scripting language to be decoded, fused, and reordered before any real work is done. Writing C and imagining a PDP-11 does not control this machine. Writing clear, well-typed code in a language that gives the compiler sufficient information to make good decisions about the µop stream does. The compiler is not the enemy of performance. The compiler is the only entity with full information about both the source program and the target machine, and it has been, for most programs most of the time, better at translating the one into the other than the programmer who believed they were close to the metal ever was.

· · ·

There is a kind of programmer for whom this essay will be unsatisfying, because it appears to argue that the details of the machine do not matter — that any language compiled by any capable compiler produces equivalent output, and that the programmer's understanding of the hardware is irrelevant. This is not the argument. The argument is narrower and more precise: that the programmer who understands the hardware in terms of x86-64 assembly is not understanding the hardware at the level that determines performance, because x86-64 assembly is not the level at which the hardware works. The programmer who understands µop throughput, cache line sizes, branch prediction patterns, and vectorisation constraints — who thinks about performance in terms of the processor's actual resources rather than the source language's apparent proximity to them — has a genuine understanding of the hardware and can use that understanding to write code, in any language with a good compiler, that the compiler can translate into an efficient µop stream. The source language is not the constraint. The programmer's understanding of the execution model is the constraint.

The C and x86-64 that inherited the world together did so not because they were honest about what computing was, but because they were useful — because the PDP-11 model, extended to 32 and then 64 bits, was good enough for most programs, and because the processors Intel and AMD built to run those programs were extraordinary engineering achievements even if their engineering was deployed largely in the service of maintaining a fiction. The fiction held for fifty years. It produced Spectre and Meltdown when it finally cracked. And it is cracking now, less dramatically but more permanently, as the compiler infrastructure that the industry built to serve C discovers that it serves every language equally — that Fortran and Rust and Julia and Ada and C all produce µops, and the µops do not know or care what language they came from. The metal was always further away than it seemed. The compiler was always smarter than the programmer believed. The machine was never a fast PDP-11. It was something far stranger and far more interesting, and the languages that acknowledge what it actually is are the ones that will use it best.

1Dennis Ritchie's account of C's development is given in "The Development of the C Language," a paper presented at the History of Programming Languages conference in 1993 and available from Ritchie's page at Bell Labs. The derivation from B and BCPL, the role of the PDP-7 and PDP-11, and the specific ways in which the PDP-11 architecture influenced C's design decisions are documented there. Ritchie is explicit that several of C's design choices — the representation of arrays as pointers, the sizes of integer types, the behaviour of pointer arithmetic — reflect the PDP-11's architecture directly. The paper is one of the most candid accounts in existence of a programming language being designed for a specific machine rather than as an abstract computational model.

2The 8086's architecture is documented in Intel's original 8086 Family User's Manual, 1979. The derivation from the 8080 and the segmented memory model are described there. The IBM PC's selection of the 8088 is documented in multiple histories of the personal computer industry; Paul Carroll's Big Blues: The Unmaking of IBM and Robert X. Cringely's Accidental Empires both cover the decision. Intel's own corporate histories acknowledge that the 8086's market dominance was not anticipated by Intel's leadership at the time of its design, and that the architecture's longevity has required continuous extension rather than replacement.

3Out-of-order execution was introduced in commercial x86 processors with Intel's Pentium Pro in 1995. The architecture, known as P6, implemented a reorder buffer, a reservation station, and an out-of-order execution core that could issue instructions from a window of approximately forty instructions. Subsequent Intel architectures have widened this window substantially; modern cores can have hundreds of instructions in flight simultaneously. The microarchitectural details of Intel's Skylake, Alder Lake, and Raptor Lake architectures are documented in Agner Fog's microarchitecture guides, which are the most complete public documentation of what modern x86 processors actually do when executing programs. Fog's guides reveal that the relationship between source-level program order and execution order is, for realistic programs, essentially nonexistent.

4David Chisnall, "C Is Not a Low-Level Language: Your computer is not a fast PDP-11," ACM Queue, Volume 16, Issue 2, April 2018; reprinted in Communications of the ACM, Volume 61, Issue 7, July 2018. The Meltdown vulnerability was disclosed by Jann Horn of Google Project Zero and independently by researchers at Graz University of Technology and Cyberus Technology in January 2018. Spectre was disclosed simultaneously, with variant 1 (bounds check bypass) and variant 2 (branch target injection) being the principal attack forms. Both vulnerabilities exploit the processor's speculative execution to allow a user-mode attacker to read memory the attacker should not be able to access. The patches — microcode updates, kernel mitigations, compiler-inserted barriers — reduced performance on certain workloads by up to thirty percent, which is a measure of how much work the processor was doing speculatively to maintain the C abstract machine at acceptable performance.

5AMD64 was introduced with the Opteron and Athlon 64 processors in 2003. Intel adopted the architecture, renaming it EM64T and later Intel 64, with the Pentium 4 Prescott in 2004. The new calling convention for x86-64 — which passes the first six integer arguments in registers (RDI, RSI, RDX, RCX, R8, R9) rather than on the stack — was a significant ABI change that improved the performance of function calls substantially, because register accesses are faster than stack accesses and because the stack operations required by the 32-bit calling convention consumed execution resources that the 64-bit calling convention did not. The x86-64 ISA reference, Intel's 64 and IA-32 Architectures Software Developer's Manual, runs to approximately five thousand pages, which is a rough measure of the complexity of the scripting language that the hardware accepts.

6LLVM was originally developed by Chris Lattner at the University of Illinois as his doctoral dissertation, beginning around 2000. LLVM IR is described in the LLVM language reference manual. The claim that LLVM IR is a genuine intermediate language shared across source languages is demonstrated by the LLVM project's list of supported frontends, which includes Clang (C, C++, Objective-C), Rust (via rustc), Swift, Julia, Crystal, Kotlin/Native, and many others. GCC's intermediate representations — GIMPLE and RTL — serve the same purpose for C, C++, Fortran, Ada, Go, and D. The performance equivalence of programs compiled from different source languages by the same backend has been demonstrated in numerous benchmarks; the Computer Language Benchmarks Game, which measures performance across a variety of benchmark programs and languages, shows that compiled languages using GCC or LLVM backends cluster tightly in performance for compute-bound workloads, with differences reflecting algorithmic choices and aliasing constraints rather than language identity.

7Intel's Itanium (IA-64) architecture was developed jointly with Hewlett-Packard, announced in 1994, and first shipped in the Itanium processor in 2001. The architecture used VLIW (very long instruction word) design with explicit parallelism exposed to the compiler, rather than hardware-detected ILP, under the assumption that compilers would schedule instructions more effectively than hardware. The performance of the first Itanium was disappointing relative to contemporary x86-64 implementations on x86-64 code running in emulation. HP and Intel had assumed that the recompilation of major software to native IA-64 would occur rapidly; it occurred slowly and incompletely. Intel shipped the final Itanium in 2019, having sold the architecture primarily into HP-UX and OpenVMS server markets where it had achieved lock-in. The history of Itanium is the most expensive demonstration in computing history of the proposition that backwards compatibility is worth more than architectural cleanliness.

8RISC-V was developed at the University of California, Berkeley, by Krste Asanović and David Patterson, with the first specification published in 2011. The architecture is genuinely clean: fixed 32-bit instruction width for the base ISA (with a 16-bit compressed extension), thirty-two general-purpose registers, a load-store architecture with no memory-to-memory operations, and a modular extension system. RISC-V's performance in data centres and embedded systems reflects the quality of its implementations rather than the quality of its instruction set; SiFive's high-performance cores and Western Digital's embedded cores achieve different performance profiles for the same ISA, just as Intel and AMD achieve different profiles for x86-64. The instruction set architecture, once decoded, disappears; only the execution engine remains.

9The aliasing analysis advantage of Rust over C is one of the most well-documented sources of potential performance difference between the two languages. C's type-based aliasing rules — introduced in C99 as a partial mitigation — allow the compiler to assume that pointers of different types do not alias, but leave ambiguous the case of pointers of the same type, where the compiler must assume aliasing is possible. Rust's ownership and borrowing system guarantees, at compile time, that no two live mutable references alias the same memory location, giving the compiler complete aliasing information without requiring runtime checks. This allows the Rust compiler to apply optimisations that the C compiler must conservatively decline. The practical magnitude of this difference is workload-dependent; for aliasing-sensitive code, the difference can be significant, while for code with little aliasing, it is negligible. The theoretical upper bound on the performance advantage is the cost of the aliasing-conservative code sequences that C compilers generate — which, in SIMD-heavy code, can be substantial.

The MutualDeception

The Mutual
Deception