Skip to main content

Non-fiction: UTF-8 (character encoding)

Overview
UTF-8 is a variable-length character encoding for Unicode that was created to represent every Unicode code point using sequences of one to four bytes. It was designed to be backward compatible with ASCII so that any byte sequence containing only ASCII characters remains unchanged, while enabling full representation of the broader Unicode repertoire. The design emphasizes simplicity, compactness for common characters, and robustness when parsing and transmitting text.

Design principles
The encoding uses a clear bit-pattern scheme that distinguishes single-byte ASCII from multi-byte sequences. Bytes beginning with 0xxxxxxx represent ASCII characters directly; bytes beginning with 110xxxxx, 1110xxxx, or 11110xxx start two-, three-, or four-byte sequences respectively, and continuation bytes all begin with 10xxxxxx. This pattern allows a decoder to identify the start of a code point and the number of following bytes without external state, supporting self-synchronization and error recovery.

Encoding details
UTF-8 maps Unicode scalar values into a compact byte form by carving the code point bits into the payload bits of the start and continuation bytes. Small code points (U+0000 to U+007F) use one byte, code points up to U+07FF use two bytes, up to U+FFFF use three bytes, and up to U+10FFFF use four bytes. The encoding intentionally avoids the UTF-16 surrogate range and restricts representable values to the Unicode range, ensuring a canonical form and preventing multiple byte sequences from representing the same code point.

Practical properties
The encoding is byte-oriented and independent of machine endianness, making it convenient for network protocols, file systems, and programming languages. Its self-synchronizing nature allows resuming decoding at arbitrary byte offsets, which is valuable for streaming, searching, and recovering from corruption. Error detection is straightforward because invalid byte patterns do not match any legal sequence, making it safer to handle malformed input and easier to implement robust parsers.

Adoption and interoperability
Because UTF-8 preserved ASCII compatibility while enabling global character coverage, it rapidly became the default encoding for the web, email, operating systems, and many programming environments. Its compactness for Latin scripts and predictable handling of multi-byte sequences made it attractive for backward compatibility and incremental migration from legacy encodings. Widespread standardization and successive protocol guidance reinforced its role as a universal interchange format.

Legacy and impact
UTF-8 transformed how software and networks handle text by providing a single, interoperable encoding for all written languages without sacrificing existing ASCII-based infrastructure. Its influence extends beyond data formats into developer tools, libraries, and user interfaces, shaping expectations about text processing and internationalization. The encoding's blend of elegance, practicality, and robustness remains a foundational piece of modern computing, simplifying global communication across platforms and systems.
UTF-8 (character encoding)

Unicode transformation format designed by Ken Thompson and Rob Pike that encodes Unicode characters using a variable-length byte sequence; UTF-8 became the dominant encoding for the web and widespread software internationalization.


Author: Ken Thompson

Ken Thompson Ken Thompson is a pioneering computer scientist known for co-creating Unix, developing B and UTF-8, advancing computer chess, and co-designing Go.
More about Ken Thompson