Theory of Parsers

A parser receives a string of tokens from the lexer and verifies that the given string of tokens can be constructed by the grammar of source language. We want a parser to report any syntax errors in an actionable fashion and to recover from common errors to continue parsing the rest of the source program.

The construction of the abstract syntax tree need not to be explicit, since translation actions can be interspersed with parsing.

In general, there are three classes of parsers:

Universal
Top-Down
Bottom-Up

Universal parsing techniques can parse any grammar, but they are too expensive for real-world problems. The most efficient top-down and bottom-up parsers only work for a sub-classes of grammars. Several of these sub-classes, particularly LL and LR grammars, are expressive enough to describe today's programming languages.

Hand-implemented parsers usually use LL grammars while auto-generated parsers use a larger class of LR grammars.

                         ┌───────────────────┐                         ┌───────────────────┐
                         │     Top-Down      │                         │     Bottom-Up     │
                         └─────────┬─────────┘                         └─────────┬─────────┘
                                   │                                             │
                                   │                                             │
                         ┌─────────▼─────────┐                         ┌─────────▼─────────┐
                    ┌────┤ Recursive-Descent ├────┐                    │        LR         │
                    │    └───────────────────┘    │                    └───────────────────┘
                    │                             │
                    │                             │
          ┌─────────▼─────────┐         ┌─────────▼─────────┐
          │    Backtracing    │         │  Non-Backtracing  │
          └─────────┬─────────┘         └─────────┬─────────┘
                    │                             │
                    │                             │
                    │                   ┌─────────▼─────────┐
                    │                   │    Predictive     │
                    │                   └─────────┬─────────┘
                    │                             │
                    │                             │
                    |    ┌───────────────────┐    |
                    └────▶       LL(k)       ◀────┘
                         └───────────────────┘

Context-Free Grammars

A context-free grammar G is defined by four sets:

V is a set of terminal symbols from which strings are formed. Terminal symbols are also referred to as tokens.
Σ is a set of non-terminals symbols that denote sets of strings. Non-terminal symbols are sometimes called syntactic variables. Non-terminals impose a hierarchical structure on the language.
R = V × (V ∪ Σ)* is a set of productions, where each production consists of a non-terminal (head), an arrow, and a sequence of terminals and/or non-terminals (body).
S ∈ V is one of the non-terminal symbols designated as the start symbol. The set of strings denoted by the start symbol is the language generated by the grammar.

In this document, we follow these conventions:

Terminal symbols:
1. Lowercase letters early in the alphabet (a, b, c).
2. The digits 0, 1, ..., 9.
3. Operator and punctuation symbols such as +, *, ,, (, ).
Non-terminal symbols:
1. Uppercase letters early in the alphabet (A, B, C).
2. The letter S, which is the start symbol.
Uppercase letters late in the alphabet (X, Y, Z) represent terminal or non-terminal grammar symbols.
Lowercase letters late in the alphabet (u, v, ..., z) represent strings of terminals.
Lowercase Greek letters (α, β, γ) represent strings of grammar symbols (terminals and non-terminals).

Context-free languages are a superset of regular languages and they are more expressive. Every regular expression is a context-free grammar too, but not vice-versa.

Grammars are capable of describing most, but not all, of the syntax of programming languages. A few syntactic constructs cannot be specified using grammars alone.

The problem of checking that identifiers must be declared before they are used.
The problem of checking that the number of formal parameters in the declaration of a function agrees with the number of actual parameters in a use of the function.

We can say the string of terminals (tokens) accepted by a parser form a superset of the programming language. Hence, subsequent phases of the compiler must analyze the result of parsing to ensure compliance with other requirements that cannot be enforced by the parser.

Derivations

The construction of a parse tree can be made precise by taking a derivational approach, in which productions are treated as rewriting rules. Beginning with the start symbol, each rewriting step replaces a non-terminal by the body of one of its productions. This derivational view corresponds to the top-down construction of a parse tree. Bottom-up parsing is related to a class of derivations known as rightmost derivations, in which the rightmost non-terminal is rewritten at each step.

Let's consider a non-terminal A in the middle of a sequence of grammar symbols (αAβ), where α and β are arbitrary strings of grammar symbols. Suppose A → γ is a production. Then, we write αAβ ⇒ αγβ. The symbol ⇒ means "derives in one step". ⇒^* means "derives in zero or more steps" and ⇒⁺ means "derives in one or more steps".

α ⇒^* α, for any string α.
If α ⇒^* β and β ⇒^* γ, then α ⇒^* γ.

α is a sentential form of G if S ⇒^* α where S is the start symbol of grammar G. A sentential form may contain both terminals and non-terminals, and may be empty. A sentence of G is a sentential form with no non-terminals. A string of terminals ω is in L(G), the language generated by G, if and only if ω is a sentence of G (S ⇒^* ω). If two grammars generate the same language, the grammars are said to be equivalent.

At each step in a derivation, there are two choices to be made.

In leftmost derivations, the leftmost non-terminal in each sentential is always chosen. If α ⇒ β is a step in which the leftmost non-terminal in α is replaced, we denote it by α ⇒_lm β.
In rightmost derivations, the rightmost non-terminal is always chosen; We denote this case by α ⇒_rm β.

Similarly, every leftmost step can be written as ωAγ ⇒_lm ωδγ, where ω consists of terminals only, A → δ is the production applied, and γ is a string of grammar symbols (terminals and non-terminals). When α derives β by a leftmost derivation, we denote it by α ⇒^*_lm β. If S ⇒^*_lm α then we say that α is a left-sentential form of the grammar G.

Rightmost derivations are called canonical derivations.

Abstract Syntax Trees (Parse Trees)

In an abstract syntax tree or a parse tree, each interior node represents a production. An interior node is labeled with a non-terminal in the head of the production; the children of the node are labeled, from left to right, by the symbols in the body of the production. The leaves of a parse tree are labeled by terminals or non-terminals and, read from left to right, constitute a sentential form, called the yield or frontier of the tree.

An abstract syntax tree does not show the order in which productions are applied to replace non-terminals. Hence, there is a many-to-one relationship between derivations and parse trees.

Every AST (parse tree) is associated with one unique leftmost derivation and one unique rightmost derivation.

Ambiguous Grammars

A grammar that produces more than one syntax tree for the same string of terminals is ambiguous. In other words, an ambiguous grammar is one that produces more than one leftmost derivation or more than one rightmost derivation for the same sentence.

Consider the following ambiguous grammar.

E → E + E | E * E | - E | ( E ) | id

Now let's say we want to parse the sentence id + id * id. We can derive this sentence using a leftmost derivation in two different ways.

E ⇒ E + E ⇒ id + E ⇒ id + E * E ⇒ id + id * E ⇒ id + id * id

E ⇒ E * E ⇒ E + E * E ⇒ id + E * E ⇒ id + id * E ⇒ id + id * id

And the corresponding syntax trees are shown below.

        ┌────E────┐              ┌────E────┐
        │    │    │              │    │    │
        E    + ┌──E──┐        ┌──E──┐ *    E
               │  │  │        │  │  │
               E  *  E        E  +  E
               │     │        │     │
               id   id        id   id

Ambiguity Elimination

Sometimes, an ambiguous grammar can be rewritten to eliminate the ambiguity.

Sometimes, it is more convenient to use carefully chosen ambiguous grammars, together with disambiguating rules that eliminate undesirable parse trees, leaving only one tree for each sentence.

Left-Recursive Grammars

A grammar is left recursive if it has a non-terminal A such that there is a derivation A ⇒⁺ Aα for some string.

In an immediate or direct left recursion, there is a production of the form A → Aα in which the leftmost symbol of the body is the same as the non-terminal at the head of the production. In an indirect left recursion, the definition of left recursion is satisfied via several derivations.

Left-Recursion Elimination

Left recursion poses problems for parsers. For recursive-descent top-down parsering, left recursion causes the parser to loop forever. Many bottom-up parsers expect the productions to be in normal form and will not accept left recursive forms.

Consequently, context-free grammars are often preprocessed to eliminate the left recursion.

A left recursive production can be eliminated by rewriting the offending production. Consider the following production where α and β are sequences of terminals and non-terminals that do not start with A.

A → Aα | β

We will rewrite the productions for A in the following manner using a new non-terminal R.

A → βR
R → αR | ε

In its general form, immediate left recursion can be eliminated by the following technique, which works for any arbitrary number of A-productions. Group the productions as

A → Aα₁ | Aα₂ | ... | Aαₙ | β₁ | β₂ | ... | βₘ

Where no βᵢ begins with an A. Then, replace the A-productions by

A → β₁A′ | β₂A′ | ... | βₘA′
A′ → α₁A′ | α₂A′ | ... | αₙA′ | ε

The non-terminal A generates the same strings as before but is no longer left recursive. This procedure eliminates all left recursion from the A and A′ productions, but it does not eliminate left recursion involving derivations of two or more steps, like the example below.

S → Aa | b
A → Ac | Sd | ε

This algorithm systematically eliminates left recursion from a grammar. It is guaranteed to work if the grammar has no cycles (derivations of the form A ⇒⁺ Aα) or ε-productions (productions of the form A → ε). Cycles and ε-productions can systematically be eliminated from a grammar.

Left Factoring

Left factoring is a grammar transformation that is useful for producing a grammar suitable for predictive or top-down parsing.

When the choice between two alternative A-productions is not clear, we may be able to rewrite the productions to defer the decision until enough of the input has been seen that we can make the right choice.

In general, if A → αβ₁ | αβ₂ are two A-productions, and the input begins with a non-empty string derived from α, we do not know whether to expand A to αβ₁ or αβ₂. However, we can defer the decision by expanding A to αA′. Then, after seeing the input derived from α′, we expand A′toβ₁or toβ₂`. The left-factored original productions become the following.

A → αA′
A′ → β₁ | β₂

This algorithm implements this approach for the general case.

Top-Down Parsing

Top-down parsing can be understood as constructing a parse tree for the input string, starting from the root and creating the nodes of the parse tree in preorder. Equivalently, top-down parsing can be viewed as finding a leftmost derivation for an input string.

At each step, the key problem is determining the production to be applied for a non-terminal A. Once an A-production is chosen, the rest of the parsing process consists of matching the terminal symbols in the production body with the input string.

The class of grammars for which we can construct predictive parsers looking k symbols ahead in the input is sometimes called the LL(k) class.

Recursive-Descent Parsing

A recursive-descent parser consists of a set of procedures, one for each non-terminal symbol. Execution begins with the procedure for the start symbol, which stops successfully if its procedure body scans the entire input string.

1:  void A() {
2:    Choose an A-production, A → X₁ X₂ ... Xₖ
3:    for (i = 1 to k) {
4:      if (Xᵢ is a non-terminal) {
5:        call procedure Xᵢ();
6:      } else if (Xᵢ equals the current input symbol a) {
7:        advance the input to the next symbol;
8:      } else {
9:        /* an error has occurred */
10:     }
11:   }
12: }

General recursive-descent requires backtracking with repeated scans over the input. Backtracking parsers are not common due to the fact that backtracking is rarely needed to parse programming language constructs.

To account for backtracking in the above pseudocode, we first need to try each of several A-productions in some order at line 2. Then, failure at line 9 is not the ultimate error and we need to return to line 2 and try another A-production. If there are no more A-productions to try, we declare that an input error has been occurred. In order to try another A-production, we need to be able to reset the input pointer to where it was when we first reached line 2.

A left-recursive grammar can cause a recursive-descent parser, even one with backtracking, to go into an infinite loop.

FIRST and FOLLOW

FIRST and FOLLOW are associated with a grammar G. During top-down parsing, FIRST and FOLLOW help with selecting which production to use, based on the next input symbol. During panic-mode error recovery, sets of tokens produced by FOLLOW can be used as synchronizing tokens.

FIRST

Define FIRST(α), where α is any string of grammar symbols, to be the set of terminals that begin strings derived from α. If S ⇒^* ε, then ε is also in FIRST(α).

To compute FIRST(X) for all grammar symbols X, we apply the following rules until no more terminals or ε can be added to any FIRST set.

If X is a terminal, then FIRST(X) = {X}.
If X is a non-terminal and X → Y₁ Y₂ ... Yₖ is a production for some k >= 1, then place a in FIRST(X) if for some i, a is in FIRST(Yᵢ), and ε is in all of FIRST(Y₁), ..., FIRST(Yᵢ−₁); that is, Y₁ ... Yᵢ−₁ ⇒^* ε. If ε is in FIRST(Yⱼ) for all j = 1, 2, ..., k, then add ε to FIRST(X). Note that everything in FIRST(Y₁) is surely in FIRST(X). If Y₁ does not derive ε, then we add nothing more to FIRST(X), but if Y₁ ⇒^* ε, then we add FIRST(Y₂), and so on.
If X → ε is a production, then add ε to FIRST(X).

We can now compute FIRST for any string X₁ X₂ ... Xₙ as follows.

Add all non-empty symbols of FIRST(X₁) to FIRST(X₁ X₂ ... Xₙ).
If ε is in FIRST(X₁), add the non-empty symbols of FIRST(X₂).
If ε is in FIRST(X₁) and FIRST(X₂), add the non-empty symbols of FIRST(X₃).
Continue like that up to Xₙ.
If for all i, ε is in FIRST(Xᵢ), add ε to FIRST(X₁ X₂ ... Xₙ).

FOLLOW

Define FOLLOW(A), for non-terminal A, to be the set of terminals a that can appear immediately to the right of A in some sentential form. Strictly speaking, the set of terminals a such that there exists a derivation of the form S ⇒^* αAaβ, for some α and β.

There may have been symbols between A and a at some point during the derivation, but if so, they derived ε and disappeared. Moreover, if A can be the rightmost symbol in some sentential form, then $ is in FOLLOW(A). $ is a special endmarker symbol that is assumed not to be a symbol of any grammar.

To compute FOLLOW(A) for all non-terminals A, we apply the following rules until nothing can be added to any FOLLOW set.

Place $ in FOLLOW(S), where S is the start symbol, and $ is the input right endmarker.
If there is a production A → αBβ, then everything in FIRST(β) except ε is in FOLLOW(B).
If there is a production A → αB, or a production A → αBβ, where FIRST(β) contains ε, then everything in FOLLOW(A) is in FOLLOW(B).

LL(1) Grammars and Predictive Parsers

A predictive parser is a recursive-descent parser without backtracking. It can be constructed for a class of grammars called LL(1). The first "L" in LL(1) stands for scanning the input from left to right, the second "L" for producing a leftmost derivation, and the "1" for using one input symbol of lookahead at each step.

The class of LL(1) grammars is expressive enough to cover most programming constructs. There are a few considerations though. No left-recursive or ambiguous grammar can be LL(1).

A grammar G is LL(1) if and only if whenever A → α | β are two distinct productions of G, the following conditions hold:

For no terminal a do both α and β derive strings beginning with a.
At most one of α and β can derive the empty string ε.
If β ⇒^* ε then α does not derive any string beginning with a terminal in FOLLOW(A). Likewise, if α ⇒^* ε, then β does not derive any string beginning with a terminal in FOLLOW(A).

The first two conditions are equivalent to the statement that FIRST(α) and FIRST(β) are disjoint sets. The third condition is saying that if ε is in FIRST(β), then FIRST(α) and FOLLOW(A) are disjoint sets, and likewise if ε is in FIRST(α).

Predictive parsers can be constructed for LL(1) grammars since the proper production to apply for a non-terminal can be selected by looking only at the current input symbol.

This algorithm collects the information from FIRST and FOLLOW sets into a predictive parsing table M[A,a], where A is a non-terminal, and a is a terminal or the symbol $, the input endmarker. if G is left-recursive or ambiguous, then M will have at least one multiply defined entry. There are some grammars for which no amount of alteration (left-recursion elimination and left factoring) will produce an LL(1) grammar.

Non-Recursive Predictive Parsing

A non-recursive predictive parser can be built by maintaining a stack explicitly. The parser mimics a leftmost derivation. If w is the input that has been matched so far, then the stack holds a sequence of grammar symbols α such that S ⇒^*_lm wα

A diagram for a tablen-driven parser is shown below. It has an input buffer, a stack containing a sequence of grammar symbols, a parsing table constructed using this algorithm, and an output stream.

                  ┌─┬─┬─┬─┬─┬─┬─┬─┐
                  │ │ │ │ │a│+│b│$│
                  └─┴─┴─┴─┘▲└─┴─┴─┘
                           │
                    ┌──────┴─────┐
             ┌─┐    │ Predictive │
             │X│◄───┤   Parser   ├───► Output
             ├─┤    └─────┬──────┘
             │Y│          │
             ├─┤    ┌─────▼──────┐
             │Z│    │  Parsing   │
             ├─┤    │  Table M   │
             │$│    └────────────┘
             └─┘

The input buffer contains the string to be parsed, followed by the endmarker $. We reuse the symbol $ to mark the bottom of the stack, which initially contains the start symbol of the grammar on top of $.

The parser takes X, the symbol on top of the stack, and a, the current input symbol. If X is a non-terminal, the parser chooses an X-production by consulting entry M[X,a]. Otherwise, it checks for a match between the terminal X and current input symbol a. This behavior is implemented using this algorithm.

Bottom-Up Parsing

A bottom-up parse can be viewed as the construction of a parse tree for an input string beginning at the leaves (the bottom) and working up towards the root (the top).

Reductions

Bottom-up parsing can be thought of as reducing a string w to the start symbol of the grammar. At each reduction step, a specific substring matching the body of a production is replaced by the non-terminal at the head of that production. The key decisions during bottom-up parsing are about when to reduce and about what production to apply.

By definition, a reduction is the reverse of a step in a derivation. Therefore, the goal of bottom-up parsing is to construct a derivation in reverse.

Handle Pruning

Bottom-up parsing during a left-to-right scan of the input constructs a rightmost derivation in reverse.

Informally, a handle is a substring that matches the body of a production, and whose reduction represents one step along the reverse of a rightmost derivation.

Formally, S ⇒^*_rm αAw ⇒_rm αβw, then production A → β in the position following α is a handle of αβw. Alternatively, a handle of a right-sentential form γ is a production A → β and a position of γ where the string β may be found, such that replacing β at that position by A produces the previous right-sentential form in a rightmost derivation of γ.

The string w to the right of the handle must contain only terminal symbols. We refer to the body β rather than A → β as a handle for convenience.

The grammar could be ambiguous with more than one rightmost derivation of αβw. If a grammar is unambiguous, then every right-sentential form of the grammar has exactly one handle.

A rightmost derivation in reverse can be obtained by handle pruning. We start with a string of terminals w to be parsed. If w is a sentence of the grammar at hand, then let w = γ_n, where n is the nth right-sentential form of some as yet unknown rightmost derivation

S = γ₀ ⇒ᵣₘ γ₁ ⇒ᵣₘ γ₂ ⇒ᵣₘ ... ⇒ᵣₘ γₙ₋₁ ⇒ᵣₘ γₙ = w

To reconstruct this derivation in reverse order, we locate the handle βₙ in γₙ and replace βₙ by the head of the relevant production Aₙ → βₙ to obtain the previous right-sentential form γₙ₋₁. If by continuing this process we produce a right-sentential form with only the start symbol S, then we halt and announce successful completion of parsing.

Shift-Reduce Parsing

Shift-reduce parsing is a form of bottom-up parsing in which a stack holds grammar symbols and an input buffer holds the rest of the string to be parsed. The handle always appears at the top of the stack just before it is identi ed as the handle. We use $ to mark the bottom of the stack and also the right end of the input. Initially, the stack is empty, and the string w is on the input.

Stack	Input
`$`	`w$`

During a left-to-right scan of the input string, the parser shifts zero or more input symbols onto the stack, until it is ready to reduce a string β of grammar symbols on top of the stack. It then reduces β to the head of the appropriate production. The parser repeats this cycle until it has detected an error or until the stack contains the start symbol and the input is empty.

Stack	Input
`$S`	`$`

Upon entering this configuration, the parser halts and announces successful completion of parsing. The use of a stack in shift-reduce parsing is justified by the fact that the handle will always eventually appear on top of the stack, never inside.

While the primary operations are shift and reduce, there are actually four possible actions a shift-reduce parser can make:

Shift: Shift the next input symbol onto the top of the stack.
Reduce: The right end of the string to be reduced must be at the top of the stack. Locate the left end of the string within the stack and decide with what non-terminal to replace the string.
Accept: Announce successful completion of parsing.
Error: Discover a syntax error and call an error recovery routine.

Conflicts

There are context-free grammars for which shift-reduce parsing cannot be used. Every shift-reduce parser for such a grammar can reach a configuration in which the parser, knowing the entire stack and the next k input symbols, cannot decide whether to shift or to reduce (a shift/reduce conflict), or cannot decide which of several reductions to make (a reduce/reduce conflict).

Technically speaking, these grammars are not in the LR(k) class of grammars in which The k in LR(k) refers to the number of symbols of lookahead on the input. Grammars used in compiling usually fall in the LR(1) class with one symbol of lookahead at most.

An ambiguous grammar can never be a LR grammar. Another common setting for conflicts occurs when we know we have a handle, but the stack contents and the next input symbol are insufficient to determine which production should be used in a reduction.

LR Parsing

The most common type of bottom-up parser is LR(k) parsing. The L is for left-to-right scanning of the input, the R for constructing a rightmost derivation in reverse, and the k for the number of input symbols of lookahead that are used in making parsing decisions. The cases k = 0 or k = 1 are of practical interest. We shall only consider LR parsers with k <= 1. When (k) is omitted, k is assumed to be 1.

LR parsers are table-driven, much like the non-recursive LL parsers. For a grammar to be LR it is sufficient that a left-to-right shift-reduce parser be able to recognize handles of right-sentential forms when they appear on top of the stack.

LR parsing is attractive for a number of of reasons:

LR parsers can be constructed to recognize nearly all programming language constructs for which context-free grammars can be written.
The LR-parsing method is the most general non-backtracking shift-reduce parsing method known. It can be implemented as efficiently as other more primitive shift-reduce methods.
An LR parser can detect a syntactic error as soon as it is possible to do so on a left-to-right scan of the input.
The class of grammars that can be parsed using LR methods is a proper superset of the class of grammars that can be parsed with predictive or LL methods. For a grammar to be LR(k), we must be able to recognize the occurrence of the right side of a production in a right-sentential form, with k input symbols of lookahead. This requirement is far less stringent than that for LL(k) grammars where we must be able to recognize the use of a production seeing only the first k symbols of what its right side derives.

For a practical grammar, it is too much work to construct an LR parser by hand and an LR parser generator is needed. Such a generator takes a context-free grammar and generates a parser for that grammar. If the grammar contains ambiguities or other constructs that cannot be parsed in a left-to-right scan of the input, then the parser generator locates these constructs and provides detailed diagnostic messages.

LR(0) and SLR Parsers

Items and LR(0) Automation

An LR parser makes shift-reduce decisions by maintaining states to keep track of where we are in a parse. States represent sets of items. An LR(0) item (item for short) of a grammar G is a production of G with a dot at some position of the body. Intuitively, an item indicates how much of a production we have seen at a given point in the parsing process. The production A → ε generates only one item, A → ..

One collection of sets of LR(0) items, called the canonical LR(0) collection, provides the basis for constructing a deterministic finite automaton that is used to make parsing decisions. Such an automaton is called an LR(0) automaton. Each state of the LR(0) automaton represents a set of items in the canonical LR(0) collection.

To construct the canonical LR(0) collection for a grammar, we define an augmented grammar and two functions, CLOSURE and GOTO. If G is a grammar with start symbol S, then G′, the augmented grammar for G, is G with a new start symbol S′ and production S′ → S. The purpose of this new starting production is to indicate to the parser when it should stop parsing and announce acceptance of the input. Acceptance occurs when and only when the parser is about to reduce by S′ → S.

Closure of Item Sets

If I is a set of items for a grammar G, then CLOSURE(I) is the set of items constructed from I by the two rules:

Initially, add every item in I to CLOSURE(I).
If A → α.Bβ is in CLOSURE(I) and B → γ is a production, then add the item B → .γ to CLOSURE(I), if it is not already there. Apply this rule until no more new items can be added to CLOSURE(I).

Intuitively, A → α.Bβ in CLOSURE(I) indicates that, at some point in the parsing process, we think we might next see a substring derivable from Bβ as input. The substring derivable from Bβ will have a prefix derivable from B by applying one of the B-productions. We add items for all the B-productions; if B → γ is a production, we also include B → .γ in CLOSURE(I).

A convenient way to implement the function closure is to keep a boolean array added, indexed by the non-terminals of G, such that added[B] is set to true if and when we add the item B → .γ for each B-production B → γ.

Here is the algorithm for computing CLOSURE(I).

if one B-production is added to the closure of I with the dot at the left end, then all B-productions will be similarly added to the closure. Hence, it is not necessary in some circumstances actually to list the items B → .γ added to I by CLOSURE. A list of the non-terminals B whose productions were so added will suffice.

We divide all the sets of items of interest into two classes:

Kernel items: the initial item, S′ → .S, and all items whose dots are not at the left end.
Non-kernel items: all items with their dots at the left end, except for S′ → .S.

Moreover, each set of items of interest is formed by taking the closure of a set of kernel items. The items added in the closure can never be kernel items. Thus, we can represent the sets of items we are really interested in with very little storage if we throw away all non-kernel items, knowing that they could be regenerated by the closure process.

The Function GOTO

The second useful function is GOTO(I,X) where I is a set of items and X is a grammar symbol. GOTO(I,X) is defined to be the closure of the set of all items [A → αX.β] such that [A → α.Xβ] is in I.

Intuitively, the GOTO function is used to define the transitions in the LR(0) automaton for a grammar. The states of the automaton correspond to sets of items, and GOTO(I,X) specifies the transition from the state for I under input X.

Now, this algorithm constructs C, the canonical collection of sets of LR(0) items for an augmented grammar G′.

Use of The LR(0) Automation

The idea behind Simple LR or SLR parsing is the construction from the grammar of the LR(0) automaton. The states of this automaton are the sets of items from the canonical LR(0) collection, and the transitions are given by the GOTO function.

The start state of the LR(0) automaton is CLOSURE({[S′ → .S]}), where S′ is the start symbol of the augmented grammar. All states are accepting states. We say state j to refer to the state corresponding to the set of items I_j.

Assume the string γ of grammar symbols takes the LR(0) automaton from the start state 0 to some state j. Then, shift on next input symbol a if state j has a transition on a. Otherwise, we choose to reduce; the items in state j will tell us which production to use.

The LR-Parsing Algorithm

A schematic of an LR parser is shown below. It consists of an input, an output, a stack, a driver program, and a parsing table that has two parts (ACTION and GOTO). The driver program is the same for all LR parsers. Only the parsing table changes from one parser to another. The parsing program reads characters from an input bu er one at a time. Where a shift-reduce parser would shift a symbol, an LR parser shifts a state. Each state summarizes the information contained in the stack below it.

                  ┌────┬─────┬─────┬────┬────┬───┐
                  │    │     │     │    │    │   │
           Input  │ a1 │ ... │ ai  │... │ an │ $ │
                  │    │     │     │    │    │   │
                  └────┴─────┴──▲──┴────┴────┴───┘
                                │
           Stack                │
          ┌──────┐       ┌──────┴──────┐
          │      │       │     LR      │
          │  Sm  ◄───────┤   Parsing   ├────► Output
          │      │       │   Program   │
          ├──────┤       └──┬────────┬─┘
          │      │          │        │
          │ Sm-1 │          │        │
          │      │     ┌────▼────┬───▼───┐
          ├──────┤     │         │       │
          │      │     │ ACTION  │ GOTO  │
          │ ...  │     │         │       │
          │      │     └─────────┴───────┘
          ├──────┤
          │      │
          │  $   │
          │      │
          └──────┘

The stack holds a sequence of states, s₀s₁...s_m, where s_m is on top. In the SLR method, the stack holds states from the LR(0) automaton. The canonical LR and LALR methods are similar. By construction, each state has a corresponding grammar symbol. Recall that states correspond to sets of items, and that there is a transition from state i to state j if GOTO(I_i,X) = I_j. All transitions to state j must be for the same grammar symbol X. Thus, each state, except the start state 0, has a unique grammar symbol associated with it.

Structure of The LR Parsing Table

The parsing table consists of two parts: a parsing-action function ACTION and a goto function GOTO.

The ACTION function takes as arguments a state i and a terminal a (or $, the input end-marker). The value of ACTION[i,a] can have one of four forms:
- Shift j, where j is a state. The action taken by the parser effectively shifts input a to the stack, but uses state j to represent a.
- Reduce A → β. The action of the parser effectively reduces β on the top of the stack to head A.
- Accept. The parser accepts the input and finishes parsing.
- Error. The parser discovers an error in its input and takes some corrective action.
We extend the GOTO function, defined on sets of items, to states. If GOTO[I_i,A] = I_j , then GOTO also maps a state i and a non-terminal A to state j.

LR-Parser Configurations

To describe the behavior of an LR parser, it helps to have a notation representing the complete state of the parser: its stack and the remaining input. A configuration of an LR parser is a pair:

(s₀s₁...s_m, a_ia_i+1...a_n$)

where the first component is the stack contents (top on the right), and the second component is the remaining input. This configuration represents the right-sentential form

X₁X₂...X_m, a_ia_i+1...a_n`

in essentially the same way as a shift-reduce parser would. The only difference is that instead of grammar symbols, the stack holds states from which grammar symbols can be recovered. That is, X_i is the grammar symbol represented by state s_i.

The start state of the parser, s₀, does not represent a grammar symbol, and serves as a bottom-of-stack marker, as well as playing an important role in the parse.

Behavior of the LR Parser

The next move of the parser from the configuration above is determined by reading a_i, the current input symbol, and s_m, the state on top of the stack, and then consulting the entry ACTION[s_m,a_i] in the parsing action table.

If ACTION[s_m,a_i] = shift s, the parser executes a shift move. It shifts the next state s onto the stack, entering the configuration

(s₀s₁...s_m,s, a_i+1...a_n$)

The symbol a_i need not be held on the stack, since it can be recovered from s, if needed (which in practice it never is). The current input symbol is now a_i+1.

If ACTION[s_m,a_i] = reduce A → β, then the parser executes a reduce move, entering the configuration

(s₀s₁...s_m-r,s, a_ia_i+1...a_n$)

where r is the length of β, and s = GOTO[s_m-r,A]. For the LR parsers we shall construct, X_m-r+1...X_m the sequence of grammar symbols corresponding to the states popped off the stack, will always match β. The output of an LR parser is generated after a reduce move by executing the semantic action associated with the reducing production.

If ACTION[s_m,a_i] = accept, parsing is completed.

If ACTION[s_m,a_i] = error, the parser has discovered an error and calls an error recovery routine.

The above LR-parsing algorithm is summarized here.

Constructing SLR-Parsing Tables

The SLR method begins with LR(0) items and LR(0) automata. Given a grammar G, we augment G to produce G′, with a new start symbol S′. From G′, we construct C, the canonical collection of sets of items for G′ together with the GOTO function.

The ACTION and GOTO in the parsing table are constructed using this algorithm. It requires us to know FOLLOW(A) for each non-terminal A of a grammar.

The parsing table determined by this algorithm is called the SLR(1) table for G. An LR parser using the SLR(1) table for G is called the SLR(1) parser for G, and a grammar having an SLR(1) parsing table is said to be SLR(1).

Viable Prefixes

The LR(0) automaton for a grammar characterizes the input strings that can appear on the stack of a shift-reduce parser. The stack contents must be a prefix of a right-sentential form. If the stack holds α and the rest of the input is x, then a sequence of reductions will take αx to S.

S ⇒^*_rm αx

Since the parser must not shift past the handle, not all prefixes of right-sentential forms can appear on the stack. The prefixes of right sentential forms that can appear on the stack of a shift-reduce parser are called viable prefixes.

A viable prefix is a prefix of a right-sentential form that does not continue past the right end of the rightmost handle of that sentential form. It is always possible to add terminal symbols to the end of a viable prefix to obtain a right-sentential form. SLR parsing is based on the fact that LR(0) automata recognize viable prefixes.

It is a central theorem of LR-parsing theory that the set of valid items for a viable prefix γ is exactly the set of items reached from the initial state along the path labeled γ in the LR(0) automaton for the grammar.

An NFA N for recognizing viable prefixes can be constructed by treating the items themselves as states. There is a transition from A → α.Xβ to A → αX.β labeled X, and there is a transition from A → α.Bβ to B → .γ labeled ε. Then CLOSURE(I) for set of items (states of N) I is exactly the ε-closure of a set of NFA states. Thus, GOTO(I,X) gives the transition from I on symbol X in the DFA constructed from N by the subset construction.

LR(1) and LALR Parsers

We now extend the previous LR parsing techniques to use one symbol of lookahead on the input. There are two various methods:

The canonical-LR or just LR method, which makes full use of the lookahead symbols. This method uses a large set of items, called the LR(1) items.
The lookahead-LR or LALR method, which is based on the LR(0) sets of items, and has many fewer states than typical parsers based on the LR(1) items. By carefully introducing lookaheads into the LR(0) items, we can handle many more grammars with the LALR method than with the SLR method, and build parsing tables that are no bigger than the SLR tables. LALR is the method of choice in most situations.

Canonical LR(1) Items

TBD

Constructing LR(1) Sets of Items

Canonical LR(1) Parsing Tables

Constructing LALR Parsing Tables

Ambiguous Grammars

Error-Recovery Strategies

It is very important to plan for the error handling right from the begining, both at language design level and compiler design level. We can describe the common programming errors using the following categories:

Lexical errors
Syntactic errors
Semantic errors
Logical errors

We want a parser to report errors clearly and accurately. We also want it to recover from each error quickly enough, so it can detect additional errors. If the number of errors exceeds from a limit, it is better for the parser to stop rather than producing an avalanche of errors. And all of that should not hurt the performance of parser significantly.

A few semantic errors, such as type mismatches, can be detected during parsing too. In general, accurate detection of semantic and logical errors at compile time is not easy.

Panic-Mode Recovery

In this method, when encountered with an error, the parser discards input symbols one at a time until one of a designated set of synchronizing tokens is found. The synchronizing tokens are well-defined symbols with clear and unambiguous role.

Using this method, the compiler designer need to select the synchronizing tokens. Panic-mode often skips a considerable amount of input without checking it for additional errors; however, it is a simple method and guaranteed not to go into an infinite loop.

Phrase-Level Recovery

In the second method, the parser performs a local correction on the remaining input when discovering an error. For example, it replaces the remaining input by a symbol that allows the parser to continue (replacing a comma with a semicolon). The compiler designer must be cautious about the replacement since it can lead to infinite loop.

The major drawback of this method is the complexity when dealing with a situation in which the actual error has occurred before the point of detection.

Error Productions

Another appraoch would be anticipating common errors that might be encountered. We can then augment the grammar with a set of auxiliary production rules that generate the erroneous constructs. The parser can then generate appropriate error detections and corrections for the erroneous constructs.

Global Correction

Ideally, we want our parser to make as few changes as possible when dealing with an incorrect input string. There are algorithms for choosing a minimal sequence of changes to obtain a globally least-cost correction. These optimization methods are too expensive in terms of time and memory complexities.

In fact, the notion of least-cost correction provides a benchmark for evaluating other practical error-recovery techniques.

Core Algorithms

Eliminating Left Recursion

# INPUT:  Grammar G with no cycles or ε-productions.
# OUTPUT: An equivalent grammar with no left recursion.
# METHOD: Apply the algorithm below to G.
#         The resulting non-left-recursive grammar may have ε-productions.

arrange the non-terminals in some order A₁, A₂, ..., Aₙ
for (each i from 1 to n)
  for (each j from 1 to i - 1) {
    replace each production of the form Aᵢ → Aⱼγ
    by the productions Aᵢ → δ₁γ | δ₂γ | ... | δₖγ,
    where Aⱼ → δ₁ | δ₂ | ... | δₖ are all current Aⱼ-productions
  }
  eliminate the immediate left recursion among the Aᵢ-productions

Left Factoring a Grammar

# INPUT:  Grammar G.
# OUTPUT: An equivalent left-factored grammar.
# METHOD: For each non-terminal A, finnd the longest prefix α common to two or more of its alternatives.
#         If α ≠ ε, there is a non-trivial common prefix, replace all of the A-productions
#         A → αβ₁ | αβ₂ | ... | αβₙ | γ where γ represents all alternatives that do not begin with α by
#
#             A → αA′
#             A′ → β₁ | β₂ | ... | βₙ
#
#         Here A′ is a new non-terminal.
#         Repeatedly apply this transformation until no two alternatives for a non-terminal have a common prefix.
#

Construction of a Predictive Parsing Table

# INPUT:  Grammar G.
# OUTPUT: Parsing table M.
# METHOD: For each production A → α of the grammar, do the following:
#
#   1. For each terminal a in FIRST(α), add A → α to M[A,a].
#   2. If ε is in FIRST(α), then for each terminal b in FOLLOW(A), add A → α to M[A,b].
#      If ε is in FIRST(α) and $ is in FOLLOW(A), add A → α to M[A,$] as well.
#
# If, after performing the above, there is no production at all in M[A,a], then set M[A,a] to error.
#

Table-Driven Predictive Parsing

# INPUT:  A string ω and a parsing table M for grammar G.
# OUTPUT: If ω is in L(G), a leftmost derivation of ω; otherwise, an error indication.
# METHOD: Initially, the parser is in a configuration with ω$ in the input buffer
#         and the start symbol S of G on top of the stack, above $.

let a be the first symbol of ω;
let X be the top stack symbol;
while (X != $) {  # stack is not empty
  if (X = a) {
    pop the stack and let a be the next symbol of ω;
  } else if (X is a terminal) {
    error();
  } else if (M[X,a] is an error entry) {
    error();
  } else if (M[X,a] = X → Y₁ Y₂ ... Yₖ) {
    output the production X → Y₁ Y₂ ... Yₖ;
    pop the stack;
    push Yₖ, Yₖ₋₁, ..., Y₁ onto the stack, with Y₁ on top;
  }
  let X be the top stack symbol;
}

Computation of CLOSURE

SetOfItems CLOSURE(I) {
  J = I;
  repeat
    for (each item A → α.Bβ in J )
      for (each production B → γ of G )
        if ( B → .γ is not in J )
          add B → .γ to J;
  until no more items are added to J on one round;
  return J;
}

Computation of The Canonical Collection of Sets of LR(0) Items

void items(G′) {
  C =  CLOSURE({[S′ → .S]}) ;
  repeat
    for ( each set of items I in C )
      for ( each grammar symbol X )
        if ( GOTO(I,X) is not empty and not in C )
          add GOTO(I,X) to C;
  until no new sets of items are added to C on a round;
}

LR-Parsing Algorithm

# INPUT:  An input string w and an LR-parsing table with functions ACTION and GOTO for a grammar G.
# OUTPUT: If w is in L(G), the reduction steps of a bottom-up parse for w; otherwise, an error indication.
# METHOD: Initially, the parser has s₀ on its stack, where s₀ is the initial state, and w$ in the input buffer.

let a be the first symbol of w$;
while (true) {
  let s be the state on top of the stack;
  if (ACTION[s,a] = shift t) {
    push t onto the stack;
    let a be the next input symbol;
  } else if (ACTION[s,a] = reduce A → β) {
    pop |β| symbols off the stack;
    let state t now be on top of the stack;
    push GOTO[t,A] onto the stack;
    output the production A → β;
  } else if (ACTION[s,a] = accept) {
    break;
  } else {
    call error-recovery routine;
  }
}

Constructing an SLR-Parsing Table

# INPUT:  An augmented grammar G′.
# OUTPUT: The SLR-parsing table functions ACTION and GOTO for G′.
# METHOD:
#
#   1. Construct C = {I₀, I₁, ..., Iₙ}, the collection of sets of LR(0) items for G′.
#
#   2. State i is constructed from Iᵢ. The parsing actions for state i are determined as follows:
#
#        (a) If [A → α.aβ] is in Iᵢ and GOTO(Iᵢ,a) = Iⱼ, then set ACTION[i,a] to "shift j". Here a must be a terminal.
#        (b) If [A → α.] is in Iᵢ, then set ACTION[i,a] to "reduce A → α" for all a in FOLLOW(A). Here A may not be S′.
#        (c) If [S′ → S.] is in Iᵢ, then set ACTION[i,$] to "accept".
#
#      If any conflicting actions result from the above rules, we say the grammar is not SLR(1).
#      The algorithm fails to produce a parser in this case.
#
#   3. The goto transitions for state i are constructed for all non-terminals A using the rule:
#      If GOTO(Iᵢ,A) = Iⱼ, then GOTO[i,A] = j.
#
#   4. All entries not de ned by rules (2) and (3) are made "error".
#
#   5. The initial state of the parser is the one constructed from the set of items containing [S′ → .S].
#

Implementation

More Resources

Parsing
- Top-Down Parsers
  - Recursive Descent Parser
    - LL Parser
    - Parser Combinator
- Bottom-Up Parsering
  - LR Parser