Design of Emerge

In this document, we go over some of the design decisions and rationals behind Emerge.

Regular Expression

Language Design

The following features are NOT included in Emerge's Regular Expression language. They seem unnecessary for the purpose of designing and defining tokens of a language.

Backreference
Non-capturing group modifier ?:
Lookarounds ?= ?! ?<= ?<!
Anchors:
- Word boundary \b
- Non-word boundary\B
- Start of string only \A
- End of string only \Z
- End of string only (not newline) \z
- Previous match end \G

The following features are implemented slightly different in Emerge's Regular Expression language.

The . (dot) matches any Unicode character in the range 0x00–0x10FFFF, including the newlines.

Parser Design

For building a Lexer from an EBNF input, we need to parse regular expression patterns in the input, so we can construct a DFA for each pattern. To this end, we need to first build a parser for regular expressions.

Building a regex parser is fairly simple and straightforward. Implementing a separate lexer and parser for regular expressions is an inessential complexity (i.e., whitespace characters do not need to be stripped out).

A simple parser for Emerge's regular expressions is built that takes care of terminal symbols. This parser is implemented as a Top-Down Parser using Parser Combinators.

A parser combinator is a higher-order function that accepts a stream of input characters and returns a parsing result. Using a functional programming style, we can implement a context-free grammar (Type-2 language) as a single function that receives an input stream and returns an abstract syntax tree.

We will later use regular expression ASTs to construct DFAs needed for generating a lexer for an EBNF grammar.

Extended Backus-Naur Form

Language Design

The following terminal symbols are removed from Emerge's EBNF language for simplicity and brevity.

Concatenation (,)
Termination (;)
Single quotation (')

The Solidus (Slash) character (/) is added to Emerge's EBNF language for defining regex patterns.

Lexer Design

Lexer DFA

In the diagram above, Unicode covers all characters from 0x00 to 0x10FFFF.
The DFA's final state is evaluated only after encountering an invalid input symbol, triggering an error for the next state.
- States 10 and 12 are checked only after that invalid symbol, ensuring the DFA correctly distinguishes { from {{.
- Likewise, states 11 and 13 are checked only after an invalid symbol, so the DFA correctly recognizes } vs. }}.
After emitting a lexeme, the DFA resets to state 0.
IDENT tokens must start with an uppercase letter (a–z).
TOKEN tokens must start with a lowercase letter (A–Z).
String constraints:
- The empty string "" is allowed.
- The lexer recognizes only these escape sequences:
  - \\ \' \" \t \n \r
  - \x[0-9A-Fa-f]{2}
  - \u[0-9A-Fa-f]{4}
  - \U[0-9A-Fa-f]{8}
Regular expression constraints:
- The empty regex is not allowed (// starts a single-line comment).
- After a backslash, any character may be escaped (the regex parser later validates which escapes are legal).
Comment constraints:
- Empty comments // and /**/ are allowed.

Lexer DFA Code

digraph "DFA" {
  rankdir=LR;
  concentrate=false;
  node [shape=circle];
  edge [color=darkblue fontcolor=red];

  start [style=invis];

[label="0" shape=circle style=filled color=teal];
[label="1" shape=doublecircle style=filled color=khaki];
[label="2" shape=doublecircle style=filled color=khaki];
[label="3" shape=doublecircle style=filled color=skyblue];
[label="4" shape=doublecircle style=filled color=skyblue];
[label="5" shape=doublecircle style=filled color=skyblue];
[label="6" shape=doublecircle style=filled color=skyblue];
[label="7" shape=doublecircle style=filled color=skyblue];
[label="8" shape=doublecircle style=filled color=skyblue];
[label="9" shape=doublecircle style=filled color=skyblue];
[label="10" shape=doublecircle style=filled color=skyblue];
[label="11" shape=doublecircle style=filled color=skyblue];
[label="12" shape=doublecircle style=filled color=skyblue];
[label="13" shape=doublecircle style=filled color=skyblue];
[label="14" shape=doublecircle style=filled color=skyblue];
[label="15" shape=doublecircle style=filled color=skyblue];
[label="16" shape=circle];
[label="17" shape=doublecircle style=filled color=tan1];
[label="18" shape=circle];
[label="19" shape=circle];
[label="20" shape=circle];
[label="21" shape=circle];
[label="22" shape=doublecircle style=filled color=orchid1];
[label="23" shape=circle];
[label="24" shape=circle];
[label="25" shape=circle];
[label="26" shape=circle];
[label="27" shape=doublecircle style=filled color=orchid1];
[label="28" shape=circle];
[label="29" shape=circle];
[label="30" shape=circle];
[label="31" shape=doublecircle style=filled color=orchid1];
[label="32" shape=doublecircle style=filled color=orangered];
[label="33" shape=doublecircle style=filled color=orangered];
[label="34" shape=doublecircle style=filled color=orangered];
[label="35" shape=doublecircle style=filled color=orangered];
[label="36" shape=doublecircle style=filled color=orangered];
[label="37" shape=doublecircle style=filled color=orangered];
[label="38" shape=doublecircle style=filled color=chocolate];
[label="39" shape=doublecircle style=filled color=orangered];
[label="40" shape=doublecircle style=filled color=dodgerblue];
[label="41" shape=circle];
[label="42" shape=circle];
[label="43" shape=circle];
[label="44" shape=circle];
[label="45" shape=circle];
[label="46" shape=circle];
[label="47" shape=circle];
[label="48" shape=circle];
[label="49" shape=circle];
[label="50" shape=circle];
[label="51" shape=circle];
[label="52" shape=circle];
[label="53" shape=circle];
[label="54" shape=circle];
[label="55" shape=circle];
[label="56" shape=circle];
[label="57" shape=circle];
[label="58" shape=circle];
[label="59" shape=circle];
[label="60" shape=circle];
[label="61" shape=doublecircle style=filled color=gold];
[label="62" shape=circle];
[label="63" shape=circle];
[label="64" shape=circle];
[label="65" shape=doublecircle style=filled color=gold];
[label="66" shape=doublecircle style=filled color=turquoise];
[label="67" shape=circle];
[label="68" shape=circle];
[label="69" shape=doublecircle style=filled color=turquoise];

  start -> 0 [];

-> 1 [label="\\t, SP"];
-> 2 [label="\\n, \\r"];
-> 3 [label="="];
-> 4 [label=";"];
-> 5 [label="|"];
-> 6 [label="("];
-> 7 [label=")"];
-> 8 [label="["];
-> 9 [label="]"];
-> 10 [label="{"];
-> 11 [label="}"];
-> 14 [label="<"];
-> 15 [label=">"];
-> 16 [label="$"];
-> 18 [label="@"];
-> 32 [label="g"];
-> 39 [label="[a..f], [h..z]" color=darkgreen];
-> 40 [label="[A..Z]"];
-> 41 [label="\""];
-> 62 [label="/"];
-> 1 [label="\\t, SP"];
-> 2 [label="\\n, \\r"];
-> 12 [label="{"];
-> 13 [label="}"];
-> 17 [label="[A..Z]"];
-> 17 [label="[0..9], [A..Z], _"];
-> 19 [label="l"];
-> 23 [label="r"];
-> 28 [label="n"];
-> 20 [label="e"];
-> 21 [label="f"];
-> 22 [label="t"];
-> 24 [label="i"];
-> 25 [label="g"];
-> 26 [label="h"];
-> 27 [label="t"];
-> 29 [label="o"];
-> 30 [label="n"];
-> 31 [label="e"];
-> 33 [label="r"];
-> 39 [label="[0..9], _, [a..q], [s..z]" color=darkgreen];
-> 34 [label="a"];
-> 39 [label="[0..9], _, [b..z]" color=darkgreen];
-> 35 [label="m"];
-> 39 [label="[0..9], _, [a..l], [n..z]" color=darkgreen];
-> 36 [label="m"];
-> 39 [label="[0..9], _, [a..l], [n..z]" color=darkgreen];
-> 37 [label="a"];
-> 39 [label="[0..9], _, [b..z]" color=darkgreen];
-> 38 [label="r"];
-> 39 [label="[0..9], _, [a..q], [s..z]" color=darkgreen];
-> 39 [label="[0..9], _, [a..z]" color=darkgreen];
-> 39 [label="[0..9], _, [a..z]" color=darkgreen];
-> 40 [label="[0..9], [A..Z], _"];
-> 41 [label="All Unicode except \\ \"" color=darkcyan];
-> 42 [label="\\" color=darkorange];
-> 61 [label="\"" color=darkgreen];
-> 43 [label="\", ', \\, n, r, t"];
-> 44 [label="x"];
-> 47 [label="u"];
-> 52 [label="U"];
-> 41 [label="All Unicode except \\ \"" color=darkcyan];
-> 42 [label="\\" color=darkorange];
-> 61 [label="\"" color=darkgreen];
-> 45 [label="[0..9], [A..F], [a..f]"];
-> 46 [label="[0..9], [A..F], [a..f]"];
-> 41 [label="All Unicode except \\ \"" color=darkcyan];
-> 42 [label="\\" color=darkorange];
-> 61 [label="\"" color=darkgreen];
-> 48 [label="[0..9], [A..F], [a..f]"];
-> 49 [label="[0..9], [A..F], [a..f]"];
-> 50 [label="[0..9], [A..F], [a..f]"];
-> 51 [label="[0..9], [A..F], [a..f]"];
-> 41 [label="All Unicode except \\ \"" color=darkcyan];
-> 42 [label="\\" color=darkorange];
-> 61 [label="\"" color=darkgreen];
-> 53 [label="[0..9], [A..F], [a..f]"];
-> 54 [label="[0..9], [A..F], [a..f]"];
-> 55 [label="[0..9], [A..F], [a..f]"];
-> 56 [label="[0..9], [A..F], [a..f]"];
-> 57 [label="[0..9], [A..F], [a..f]"];
-> 58 [label="[0..9], [A..F], [a..f]"];
-> 59 [label="[0..9], [A..F], [a..f]"];
-> 60 [label="[0..9], [A..F], [a..f]"];
-> 41 [label="All Unicode except \\ \"" color=darkcyan];
-> 42 [label="\\" color=darkorange];
-> 61 [label="\"" color=darkgreen];
-> 63 [label="\\" color=darkorange];
-> 64 [label="All Unicode except / \\ *" color=darkgreen];
-> 66 [label="/"];
-> 67 [label="*"];
-> 64 [label="All Unicode" color=darkorange];
-> 63 [label="\\" color=darkorange];
-> 64 [label="All Unicode except / \\" color=darkgreen];
-> 65 [label="/"];
-> 66 [label="All Unicode except \\n \\v \\f \\r" color=darkgreen];
-> 67 [label="All Unicode except *" color=darkgreen];
-> 68 [label="*"];
-> 67 [label="All Unicode except * /" color=darkgreen];
-> 68 [label="*"];
-> 69 [label="/"];
}

Input Buffer

A two-buffer scheme, explained here, is employed for implementing the EBNF lexer. The two buffers are implemented as one buffer divided into two halves.

Parser Design

The EBNF parser is implemented as a bottom-up LALR parser, ensuring efficient and deterministic parsing.

The parsing table for EBNF is generated using this algorithm based on the grammar and precedence rules defined here.

To implement an LR parser, the grammar must be in LR(1) form. LR(1) grammars require minimal transformations, often closely resembling natural language structures. Ambiguous grammars can also be handled using precedence rules.

Emerge parser generator also produces LALR parsers for the same reasons mentioned above, balancing efficiency and expressiveness.

For error handling, the panic-mode error recovery method is used due to its simplicity and adaptability to any arbitrary grammar.

Design of Emerge

Theory and in-depth details about the design and implementation of Emerge.

Design of Emerge

Regular Expression

Language Design

Parser Design

Extended Backus-Naur Form

Language Design

Lexer Design

Input Buffer

Parser Design

Resources