Essay: Regular Expression Search Algorithm

Background and motivation

Thompson's 1968 essay "Regular Expression Search Algorithm" addresses the practical need to execute pattern searches expressed as regular expressions on real text efficiently. At that time, regular expressions were well understood as a mathematical formalism, but implementations tended to be slow or memory-hungry when applied to large texts or complex patterns. Thompson sought a design that was simple to implement on the computing environments of the era yet robust enough for everyday text-processing tasks like searching and editing.
The approach starts from the observation that the algebraic operations of regular expressions, concatenation, alternation, and Kleene closure, map naturally to small automata fragments. By assembling these fragments according to the structure of a regular expression, it becomes possible to get a compact, directly executable representation that can be simulated on input text without an expensive full determinization step.

Core construction

The central idea is a systematic construction of a nondeterministic finite automaton (NFA) from a regular expression using epsilon transitions. Each basic element of the expression gets converted into a tiny fragment with designated entry and exit points. Fragments are combined by connecting these points: concatenation wires one fragment into the next, alternation creates a fork with epsilon branches to alternatives, and the Kleene star inserts loops with epsilon links to allow repetition.
This fragment-oriented construction is local and compositional, so the NFA grows in proportion to the size of the expression. The design ensures that control flow through the automaton mirrors the nested structure of the expression, and the use of epsilon transitions keeps the explicit number of labeled transitions small and predictable.

Algorithm and implementation

Once the NFA is built, matching proceeds by simulating the NFA on the input string using sets of active states rather than attempting to convert the NFA into a deterministic automaton. The simulation maintains a current set of states reachable after reading each input symbol, computes the set of states reachable by following transitions labeled with the current symbol, and repeatedly closes that set under epsilon transitions. This set-based simulation avoids backtracking through alternatives and guarantees that the work per input character is proportional to the number of NFA states visited.
Thompson outlines an implementation-friendly strategy where fragment construction uses simple pointer manipulations and the runtime simulation uses array- or list-based state sets, both well suited to the limited memory and processing resources of the 1960s. The resulting matcher has predictable space requirements tied to the expression size and running time that is linear in the input length times a factor bounded by the expression complexity, which in practice yields fast and reliable matching.

Legacy and influence

The construction and simulation method introduced by Thompson became a foundational technique for practical regular-expression engines and for many Unix text-processing tools. Its emphasis on a compact NFA representation and set-based simulation influenced implementations of grep, editors, and later libraries that prioritized regularity of performance and avoidance of catastrophic backtracking. The phrase "Thompson NFA" now denotes this construction style, and several modern engines and libraries either implement it directly or use it as a conceptual basis for hybrid designs that combine NFA simulation with DFA optimizations.
Thompson's algorithm remains important because it clarifies the trade-offs between compilation cost, runtime predictability, and memory usage for pattern matching. It provided a bridge from the algebraic theory of regular languages to practical systems that perform reliably on real-world text, and its ideas continue to inform the design of fast, safe regular-expression engines today.

Regular Expression Search Algorithm

Paper by Ken Thompson describing an efficient algorithm for searching using regular expressions; the work influenced text processing tools and regular-expression engines used in Unix utilities.

Published1968
TypeEssay
GenreComputer Science, Algorithms
Languageen

View on Amazon

About the Author

Ken Thompson

Ken Thompson is a pioneering computer scientist known for co-creating Unix, developing B and UTF-8, advancing computer chess, and co-designing Go.

View Profile

OccupationScientist
FromUSA
Other Works
- ed (text editor) (1969)
- B (programming language) (1969)
- Unix Programmer's Manual (1971)
- grep (1973)
- The UNIX Time-Sharing System (1974)
- Reflections on Trusting Trust (1984)
- Plan 9 from Bell Labs (operating system) (1992)
- Inferno (operating system) (1997)
- Go (programming language) (2009)