On character encodings, universalist pretensions, and the honesty of what was lost
"The purpose of abstracting is not to be vague, but to create a new semantic level in which one can be absolutely precise."
— Edsger W. Dijkstra, The Humble Programmer, 1972
"There is nothing more deceptive than an obvious fact."
— Arthur Conan Doyle, The Boscombe Valley Mystery, 1891
The standard defence of UTF-8 everywhere rests on a claim that is rarely examined closely: that Unicode solved the problem of the same text producing different bytes in different systems, and that the magnitude of this achievement justified universal adoption. The claim is true in one direction and false in the other. Unicode did eliminate the failure in which "Bob" encoded in Windows-1252 and "Bob" encoded in KOI8-R produced different bytes. It then introduced, silently and without fanfare, a new failure in which café encoded as UTF-8 produces different bytes depending on which of four valid representations the system chose — and provides no label to tell you which one was used. The first failure was visible and self-announcing. The second is invisible and self-concealing. The argument of this essay is not that code pages were faultless. They had faults. The argument is that their faults were honest, and that a system whose faults announce themselves is, in any engineering sense that matters, preferable to one whose faults hide.
This essay proceeds through four claims: that code pages had structural properties which UTF-8 destroys; that Unicode's normalisation problem reintroduces the very failure it was built to prevent, in a more dangerous form; that the universalisation of the solution imposed full complexity costs on systems that shared none of the problem; and that alternative architectures — at the markup layer and the byte-stream layer — were viable for many of the cases where the multi-script problem actually arose, though not for all of them. That last qualification is important. The concession will be made explicitly, because overstating the case is precisely where this argument is most vulnerable and most deserving of honesty.
A code page is a bijective mapping from a set of at most 256 byte values to a set of at most 256 characters. Bijective: every byte maps to exactly one character, every character maps to exactly one byte, the mapping is total in both directions. There are no partial cases, no surrogate pairs, no reserved ranges, no byte sequences valid in one context and invalid in another. A byte is a character. This identity, exact and complete, has consequences that ramify through every subsequent operation on text.1
Because a byte is a character, string length in bytes and string length in characters are the same number. Indexing into a string at position n is O(1): the character at position n is the byte at offset n. Scanning backward through a string — as RTRIM must, to find the last non-whitespace character — is trivially safe: decrement the pointer, read a byte, know a character. Substrings are contiguous byte ranges. The set of valid strings is completely and statically defined. None of these properties are exotic. They are the preconditions for every elementary string operation to have a simple, correct, and obvious implementation. They are what UTF-8 destroys — not for some narrow class of operations reached on exceptional days, but for LEN, SUBSTR, LTRIM, RTRIM, INSTR: the functions a programmer reaches for before breakfast. In a code-page world, each of these is O(1) or trivially correct by construction. In a UTF-8 world, each requires careful multi-byte handling, and the handling fails silently whenever a programmer implements it without that care — which is, in most languages, what a programmer does naturally.2
The advocate of UTF-8 will respond with the O(1) counterproposal: use UTF-32. Encode every code point as a fixed four-byte value, recover O(1) indexing. The proposal is coherent and the objection is immediate: UTF-32 is profligate, inflating every Latin document to four times its natural size. The answer to this objection is variable-length encoding — shorter sequences for common characters, longer ones for rare ones. But this is the definition of compression, and compression is in provable tension with random access: a variable-length code assigns shorter sequences to common symbols at the cost of making any symbol's position computable only by scanning from the start. UTF-8 is, in its logical structure, a prefix-free variable-length code over Unicode code points. It is compression applied at the encoding layer. Code pages achieved space efficiency and O(1) access simultaneously, not by design cleverness but as a mathematical consequence: when your alphabet fits in 256 positions and your storage unit is a byte, both properties follow without effort. UTF-8 cannot have both simultaneously, and it achieves neither fully — it is less compact than general-purpose compression, and slower than a fixed-width encoding — because it is attempting to be both at once.
The structural argument about string operations, though true, is not the sharpest part of the case. The sharpest part concerns normalisation, and it strikes directly at the premise on which the argument for Unicode rests.
The premise is this: Unicode solved the problem of the same text producing different bytes in different systems. Before Unicode, "café" encoded in Windows-1252 and "café" encoded in a different code page might produce different bytes for the accented character; a system that received both without consulting the encoding label would compare them incorrectly. Unicode was presented as the solution: a single universal encoding in which "café" always produces the same bytes on every machine.
This premise is false. In UTF-8, the word café has at least two valid representations. The final character, e-acute, may be encoded as U+00E9 — a single precomposed code point, two bytes in UTF-8 — or as U+0065 followed by U+0301, the letter e followed by a combining acute accent, three bytes in UTF-8. Both sequences are valid Unicode. Both are valid UTF-8. Both render identically in every correctly implemented font renderer. A byte-level comparison of the two strings returns false. A code-point-level comparison also returns false: one string has four code points and the other has five. Only a comparison that first applies Unicode normalisation — reducing both strings to a canonical form before comparing — returns true. The word naïve has the same property. So does Ångström, résumé, fiancée, and every other word in any language that uses diacritical marks, which encompasses the ordinary vocabulary of French, German, Spanish, Portuguese, Swedish, Norwegian, and dozens of other languages spoken by hundreds of millions of people. The affected text is not contrived edge-case input. It is the routine content of any document written in any European language.3
The comparison with code-page failures is structural, not rhetorical. When a system received a Windows-1252 string where it expected ISO 8859-1, the result was mojibake: visually corrupted text in which accented characters rendered as wrong glyphs or boxes. Mojibake is ugly. It is also unmistakable. A developer who caused it knew, within seconds of looking at the output, that an encoding mismatch had occurred and that a conversion step was missing. The error wrote itself on the screen. A normalisation mismatch produces no such signal. The text renders correctly. The application behaves incorrectly: a search for "café" returns no results because the database stores NFD and the query arrives in NFC; a user login fails because their username was stored with a precomposed accent and their current input method produces a decomposed one; a filename created on macOS is not found from Linux because HFS+ normalised to NFD at storage time and the querying process supplies NFC. In each case, the text looks exactly right. The system is wrong in a way that is invisible until the damage — the missed search result, the failed authentication, the missing file — has already propagated through every layer that touched the string.4
Unicode has four normalisation forms: NFC, NFD, NFKC, and NFKD. The standard does not specify which form a conforming system must use; it defines the forms, describes their properties, and leaves the choice to the implementer. The result is that any two systems exchanging UTF-8 text — both conforming, both correct, both using the same encoding — may store and compare that text in ways that are mutually incompatible, and the encoding provides no label that exposes the incompatibility. The code-page era had labels. A document's charset declaration told you what encoding was in use; a missing or wrong declaration caused visible failure that demanded investigation. The UTF-8 era has one label — "UTF-8" — which says nothing about normalisation form, nothing about which of the four valid representations of café was chosen, and nothing about whether string operations are performed at the byte, code-point, or grapheme-cluster level. The universalism of the encoding actively conceals the plurality of the semantics it encodes. The old encoding problem is not solved. It is re-expressed in a form that does not announce itself.
Collation — the ordering of strings — exposes a third dimension of the failure, and does so by demonstrating that the universalist ambition was not fully coherent to begin with. There is no universal answer to whether "ä" sorts before or after "z". In Swedish, "ä" is a distinct letter that follows "z". In German, it is a variant of "a" for most sorting purposes. In Danish and Norwegian, the same glyph participates in a different ordering among the Scandinavian extra letters. These are not deficiencies awaiting correction. They are the nature of collation itself: an ordering convention of a language and culture, not a property of a glyph extractable from its bit pattern.
A code page encoded one locale's character repertoire. Its collation table was therefore a table of at most 256 entries mapping byte values to sort weights — specifiable by any competent native speaker in an afternoon, producing an O(n) sort with a negligible constant. The Unicode Collation Algorithm, defined in Unicode Technical Standard #10, approaches the problem differently: it defines a Default Unicode Collation Element Table mapping every Unicode code point to a sequence of weighted collation elements, establishing a default ordering over the entire character repertoire. This default is wrong for every locale that has ordering conventions, which is every locale. The Unicode project's response is the tailoring mechanism: per-locale overrides to the default, maintained in the Common Locale Data Repository. Swedish's tailoring places "ä", "å", and "ö" after "z". German's specifies phonebook ordering. Every locale with non-trivial ordering conventions requires a tailoring, and each tailoring is, in its logical structure, exactly what the code-page collation table was: a locale-specific specification of how a locale's characters sort. The code-page table was a primary specification. The CLDR tailoring is a correction to a universal default that was wrong from the start. The destination is identical. The route is an order of magnitude more complex.5
The pattern across normalisation, collation, and bidirectional text handling is consistent: the Unicode model defers locality into separate layers — separate standards, separate working groups, separate documents — where it reappears in the form it took under the code-page model, expressed now as correction to a universal default rather than as primary specification. A system that correctly implements UTF-8 must additionally implement normalisation and enforce a form consistently at ingestion, implement locale-aware collation, and handle bidirectional text control characters. Each is a separate problem requiring separate implementation. Code pages collapsed these requirements into a single artefact — the code page itself, with its character repertoire and its collation table — whose scope was declared and whose correctness was verifiable within that scope.
Having made the case for the prosecution, the essay must now address what it cannot dismiss: the multi-script problem is real, and code pages as standardised could not address it. A document containing both Arabic and English, a database that must store user names from arbitrary locales, an operating system interface handling filenames in any writing system — none of these could be served by a single code page. The question is not whether the requirement existed. It is how widely the requirement was distributed, and whether satisfying it demanded the full complexity of a universal byte-level encoding to be imposed on every system that processes text.
For the class of systems where the multi-script problem arose most acutely in practice — documents, web pages, and email — a markup-layer approach was a viable alternative for many situations, and it was already deployed. HTML's numeric character references allowed a document declared as ISO 8859-1 to represent any Unicode character as &#N; without abandoning its code-page encoding for the bulk of its content. A French web page in ISO 8859-1 that needed to display a Greek letter or a Chinese character could express those characters as entity references; the rest of the document retained the byte-character identity, the trivial collation, and the fixed-width properties of the underlying code page. The mechanism was part of every HTML specification from 2.0 onwards and was implemented by every browser in existence. SGML, HTML's parent, formalised character set declarations in document type definitions; the infrastructure for explicit script-switching at the markup layer was not merely theoretically available when Unicode was standardised — it was universally deployed.6
The limitations of this approach must be stated, because this is where the argument is most attackable and where overstating the case does the most damage. Numeric entity references are verbose; a passage in Greek rendered as Αλφα is not readable source text and is burdensome to author. They do not compose naturally with full-text search, which must decode them before indexing. For text that is mixed at the character level rather than the passage level — names that interleave scripts, technical terms borrowed from another script into running prose — they impose impractical authoring overhead. And for the class of interfaces where the multi-script requirement is structurally unavoidable and no markup layer exists — operating system filenames, database column values received from unknown sources, binary wire protocol fields — they are simply not applicable. A filename cannot carry a charset declaration. A fixed-width protocol field cannot embed escape sequences. These interfaces genuinely required a universal encoding, and the argument that they did not is not available.
At the byte-stream level, ISO 2022 demonstrated that a code-switching architecture was implementable: escape sequences shift the active character set within a stream, and ISO-2022-JP used this mechanism for Japanese email for decades. The mechanism makes script boundaries explicit — a processor encounters an escape sequence and knows it is changing character-set context — which is the structural virtue code pages share and UTF-8 lacks. But ISO 2022's limitations were also genuine. A lost escape sequence, through truncation or corruption, leaves the remainder of the stream undecodable; the stateful nature of the encoding made streaming implementations difficult to implement correctly and error-prone under partial reads. ISO 2022 was a proof that the federated idea was achievable, not a demonstration that it was ready to bear the full load of international text interchange without real engineering cost.7
The correct framing of the alternative, then, is not that it was fully adequate and Unicode was unnecessary. It is that the multi-script problem, in the form that required a universal byte-level encoding — arbitrary character repertoires in contexts with no markup layer — described a bounded class of interfaces. For that class, Unicode was a genuine solution. For the document-centric majority of text-processing systems, where the problem arose at the passage level and a markup-layer mechanism could have addressed it, the alternatives were workable for many cases, even if not for all. The universalisation of the encoding solution imposed the full complexity cost of the universal model on every system in every class, including the ones for which the problem did not arise in the form that made the complexity necessary.
The table below sets the two models against each other across the dimensions that determine what correctness requires of a programmer implementing a text-processing system.
| Property | Code Page | UTF-8 / Unicode |
|---|---|---|
| Byte–character identity | Exact (1 byte = 1 character) | None (1–4 bytes per code point) |
| String length | Byte count = character count | Byte count ≠ code point count ≠ grapheme count |
| Elementary string ops | O(1) by construction (LEN, SUBSTR, LTRIM, RTRIM) | O(n) or requiring careful multi-byte handling |
| Space efficiency | Optimal by construction | Achieved via compression; structurally incompatible with O(1) access |
| Canonical text representation | Unique: one encoding per string | Multiple: NFC, NFD, NFKC, NFKD — all valid, all unlabelled |
| Encoding mismatch failure mode | Mojibake: visible, immediate, self-diagnosing | Silent inequality: correct rendering, wrong comparison, deferred discovery |
| Collation rules | Per-page primary specification | Universal default wrong for all locales; corrected via CLDR tailoring |
| Stability of character repertoire | Frozen at standardisation | Grows with each Unicode release; no stable interoperability target |
| Multi-script in documents | Via numeric entity references (explicit, verbose, workable for many cases) | Native in encoding (implicit, full cost paid by all systems) |
| Scope of promise | Narrow and fulfilled | Universal and partially deferred into separate layers |
Code pages were not a failed attempt at Unicode. They were a succeeded attempt at something else: an encoding whose correctness properties — bijective, complete, fixed-width, statically defined — followed directly from its declared scope, and whose failure modes were visible precisely because the scope was declared. The encoding said: within this locale, for this script, these bytes are these characters, and this is all I will handle. The failure mode, when the declaration was wrong or absent, was immediate and conspicuous. It demanded correction.
The word everywhere is doing enormous work in "UTF-8 everywhere," and it is worth examining what it was intended to achieve. The most sympathetic reading is that it was trying to eliminate encoding errors at system boundaries — the mojibake, the wrong-charset comparison, the missing conversion step. These errors were real. But the solution to encoding errors at boundaries is not to impose a universal encoding on every system regardless of its requirements; it is to enforce correct labelling and correct conversion at the boundaries where different encodings meet. A code-page document's encoding was declarable and, when declared, checkable. The problem was not the plurality of encodings. It was systems that omitted or ignored the declaration. Replacing that plurality with a single encoding that provides no label for its most consequential semantic property — normalisation form — does not eliminate the underlying class of failure. It relocates it. The failures are now quieter, more subtle, and considerably harder to diagnose, because nothing in the surface presentation of the text indicates that an encoding problem is the cause.
1The argument in this essay is directed at single-byte code pages: ISO 8859-1 through ISO 8859-16, the Windows-125x series, KOI8-R, KOI8-U, and similar. Double-byte character set encodings for East Asian languages — Shift-JIS, GBK, EUC-JP — were already departures from the bijective model, using lead-byte ranges to signal two-byte sequences and carrying the same variable-length problems that UTF-8 introduces at scale. They were acknowledged engineering workarounds for scripts whose character repertoires exceeded 256 positions and are not the subject of the present argument.
2The security implications of the byte-character confusion in UTF-8 have been documented across two decades of vulnerability disclosures. The canonical example is the IIS directory traversal vulnerability of 2001, in which ../ encoded as an overlong UTF-8 sequence bypassed path sanitisation that operated on raw bytes while filename resolution operated on decoded characters. The structural pattern — two components of the same system applying different character-boundary assumptions to the same byte stream — has recurred in numerous forms, and its root cause is always the same: the encoding does not make character boundaries self-evident, and any implementation that does not handle them explicitly is silently wrong for a non-trivial fraction of possible inputs. The fact that the fraction depends on whether the input contains non-ASCII characters — and that most test suites exercise only ASCII — is precisely why such failures routinely survive testing.
3The precomposed form of e-acute, U+00E9, encodes to the two-byte UTF-8 sequence 0xC3 0xA9. The decomposed form, U+0065 U+0301 (Latin small letter e followed by combining acute accent), encodes to the three-byte sequence 0x65 0xCC 0x81. Both are valid UTF-8. Both render as é. They are canonically equivalent under the Unicode standard, which defines canonical equivalence as the relationship between sequences of code points that represent the same abstract character and that must be treated as identical by conforming implementations when performing canonical comparison. The standard does not require strings to be stored in a canonical form; it only requires that comparison be normalisation-aware when equivalence is required. The gap between these two requirements — storage without normalisation, comparison without normalisation-awareness — is where the failures live.
4The HFS+ file system, used on macOS until 2017, normalised filenames to NFD at the point of storage. The Linux ext4 file system performs no normalisation. A file created on macOS with an accented name is stored in NFD. A Linux process supplying the NFC form of the same name — as keyboard input methods typically produce — fails to find it. The file exists; the name is correct to visual inspection; the bytes do not match. This failure affected every developer who worked across macOS and Linux during the HFS+ era. The parallel with mojibake is instructive: mojibake is a normalisation mismatch made visible by rendering; the HFS+/ext4 failure is a normalisation mismatch concealed by correct rendering and revealed only by failed file lookups. Both are the same structural problem — same text, different bytes — dressed differently by the encoding's failure to label its semantic choices.
5The Unicode Collation Algorithm is specified in Unicode Technical Standard #10. The CLDR collation tailorings cover hundreds of locales and are logically necessary because the DUCET default must, by construction, establish an ordering between characters from different scripts that have no natural ordering relationship. Whatever ordering it establishes will be wrong for some locale's conventions. The tailoring mechanism exists to correct this — which is to say, it exists because the universal default was wrong. The code-page approach, which defined collation rules as primary specification within the page's locale, did not require this correction, because it did not create the universal fiction that then required undoing.
6Numeric character references were specified in RFC 1866 (HTML 2.0, 1995) and have been part of every HTML specification since. HTML 4.01 (1999), which governed web development for a decade, formally recommended ISO 8859-1 as the default document encoding and relied on character references for non-Latin content. SGML, HTML's parent, defined character set declarations as part of the document type definition in ISO 8879:1986. The HTML5 specification and the WHATWG living standard subsequently mandated UTF-8 as the document encoding, removing the option of declaring a code-page encoding and relying on character references for out-of-page content. The mandate was well-motivated — reducing the encoding errors that arose from incorrect or missing charset declarations — but it was a mandate at the document layer that inherited all the complexity of the Unicode model, including normalisation, without providing the normalisation label that would have made the inheritance safe.
7ISO 2022 is defined in ISO/IEC 2022:1994. ISO-2022-JP is specified in RFC 1468 and remains listed in the WHATWG encoding specification, though new use is discouraged. The fragility of escape-sequence-based encoding in streaming contexts is genuine: a truncated or corrupted stream that loses an escape sequence leaves all subsequent bytes attributed to the wrong character set, with no self-synchronising mechanism to recover. UTF-8's self-synchronising property — the ability to re-enter a stream at any byte and determine within four bytes whether that byte begins a sequence — is a real engineering advantage over ISO 2022. It does not, however, address the normalisation problem, which is orthogonal to synchronisation and is the stronger of the two objections to the universalist encoding model.