Session 10
Building the Parse Table for a Table-Driven Parser

My First Idea for a New Klein Program This Semester

After grading Module 1 yesterday, I added test programs for your scanner to the Klein collection of programs. I also updated the zip file, if you'd like to grab the updated collection again.

While preparing these notes, I ran across an old blog entry by Mark Dominus, a blogger I follow. He describes the idea of "Egyptian fractions". At one point in their history, Dominus writes, the Egyptians did not have a general way of writing fractions as m/n. Instead, they wrote such fractions as the sums of unit fractions 1/k, for different values of k.

For example, they would write 3/5 as 1/2 + 1/10, which we can abbreviate as [2, 10]. 4/9 might be written as 1/3 + 1/9, or [3, 9].

My first thought upon reading this was:

I can write a Klein program to convert any fraction into an Egyptian fraction!

What better language for the task than Klein, which is very simple and deals only with integers and booleans? Dominus describes a simple greedy algorithm that is guaranteed to produce a sequence of unit fractions equal to any given fraction. I think I can implement it in Klein, though at this point all I have are ideas. (It's all just talk until the code runs!)

Don't be surprised if this happens occasionally during the semester. Many years ago, I ran across a tweet from a distinguished mathematician, adapted his puzzle to the integers, and then wrote a series of programs [ 1 | 2 | 3 ] that culminated in a one-line program that uses only boolean expressions and a single recursive call. With one more tweak, that program will be tail recursive and thus amenable to a great optimization that we will learn how to implement this semester.

The course is officially underway. Please write a Klein compiler so that I can run my programs!

A Recap of FIRST and FOLLOW

Last time, we defined the FIRST and FOLLOW sets for a grammar, which I promised would be useful in building the parsing table for a table-driven parser. We need to know FIRST and FOLLOW for every non-terminal in the grammar. For terminals, we need only FIRST.

The rules might look imposing to you when written in formal notation, but they embody common sense ideas:

When ε is in a FIRST set, it adds a wrinkle to creating both FIRST and FOLLOW.

Let's get some practice with these rules before proceeding to today's material, which is the payoff: a reliable method for building a complete and sound parsing table.

Opening Exercise

Consider this context-free grammar:
statement   := declare identifier option-list
option-list := option option-list
             | ε
option      := scale
             | precision
scale       := fixed
             | float
precision   := single
             | double

Build FIRST and FOLLOW sets for the grammar.

A Solution

Using the intuitive rules, or the more formal set of rules for FIRST and FOLLOW we saw last time:

FIRST(declare)      FIRST(fixed)   FIRST(single)
FIRST(identifier)   FIRST(float)   FIRST(double)

  ... each contain the single terminal.  (R1)

FIRST(scale)       = { fixed, float }     (R3, P6-7)
FIRST(precision)   = { single, double }   (R3, P8-9)

FIRST(option)      = FIRST(scale)         (R3, P4)
                     ∪ FIRST(precision)   (R3, P5)
FIRST(option-list) = FIRST(option)        (R3, P2)
                     ∪ { ε }              (R2, P3)
FIRST(statement)   = { declare }          (R3, P1)

Then we use the FIRST sets to help create the FOLLOW sets:

FOLLOW(statement)   = { $ }                         (R1)
FOLLOW(option-list) = FOLLOW(statement)             (R3, P1)
FOLLOW(option)      = (FIRST(option-list) - { ε })  (R2, P2)
                      ∪ FOLLOW(option-list)         (R3, P2, P3)
FOLLOW(scale)       = FOLLOW(option)                (R3, P4)
FOLLOW(precision)   = FOLLOW(option)                (R3, P4)

How do these sets change if identifier follows the option-list in a statement?

Sub-Goal Accomplished: Back to the Main Goal

Table-driven parsing shifts knowledge for parsing a grammar out of procedures that recognize each non-terminal into data, entries in a table, which are then processed by a common algorithm. Not surprisingly, then, the key to building a table-driven parser lies in constructing a suitable parsing table.

Last time, we began to formalize this process in terms of two sets, FIRST and FOLLOW, that can be computed for the symbols in a grammar.

As a running example, we have been working with this simple grammar for arithmetic expressions:

expression  := term expression'
expression' := + expression
             | ε
term        := factor term'
term'       := * term
             | ε
factor      := ( expression )
             | identifier

Last time, we built FIRST and FOLLOW sets for this grammar:

FIRST(factor) { identifier, ( }
FIRST(term') { *, ε }
FIRST(term) { identifier, ( }
FIRST(expression') { +, ε }
FIRST(expression) { identifier, ( }
FOLLOW(expression) { ), $ }
FOLLOW(expression') { ), $ }
FOLLOW(term) { +, ), $ }
FOLLOW(term') { +, ), $ }
FOLLOW(factor) { *, +, ), $ }

Now, let's learn an algorithm that uses these functions to generate a parsing table for use by table-driven parser.

How to Build a Parsing Table with FIRST and FOLLOW Sets

Like the definitions of FIRST and FOLLOW themselves, the idea for using them to build a parse table is straightforward enough, with one special case.

Suppose that A := α is a rule in our grammar and that terminal a is in FIRST(α). Then the parser should expand A with α whenever the next token is a. The only case that can break this rule is when ε is derivable from α, such as when there is a rule α := ε. In that case, the parser should expand A with α for each terminal b in FOLLOW(A). The last situation holds even if the next token is $, indicating the end of the token stream.

So, given a grammar, we can produce a parsing table M as follows:

  1. For each production A := α, do the following.
    • For each terminal a in FIRST(α), add A := α to table entry M[A, a].
    • If ε is in FIRST(α):
      For each terminal b in FOLLOW(A), add A := α to table entry M[A, b].
  2. Mark any table entry M[N, i] that contains no expansion rule as an error.

That's it! The algorithm for building the table is much simpler than the algorithms for building the FIRST and FOLLOW sets. Our effort to build the sets has paid off.

Using the Algorithm to Build a Parsing Table

Let's use this algorithm to build a parsing table for our running example. We can process the production rules in order:

  1. (Rule 1) expression := term expression'
    • M[ expression, identifier ] = Rule 1
    • M[ expression, ( ] = Rule 1
  2. (Rule 2) expression' := + expression
    • M[ expression', + ] = Rule 2
  3. (Rule 3) expression' := ε
    • M[ expression', ) ] = Rule 3
    • M[ expression', $ ] = Rule 3
  4. (Rule 4) term := factor term'
    • M[ term, identifier] = Rule 4
    • M[ term, ( ] = Rule 4
  5. (Rule 5) term' := * term
    • M[ term', * ] = Rule 5
  6. (Rule 6) term' := ε
    • M[ term', + ] = Rule 6
    • M[ term', ) ] = Rule 6
    • M[ term', $ ] = Rule 6
  7. (Rule 7) factor := ( expression )
    • M[ factor, ( ] = Rule 7
  8. (Rule 8) factor := identifier
    • M[ factor, identifier ] = Rule 8
  9. All other entries in M[A,i] are errors.

Compare our new table with the parsing table I gave last time. They are identical!

This process takes some effort, but it has a huge upside. With care, it builds a complete and sound parsing table for any suitable context-free grammar. And the algorithm that uses the table to parse an expression is simple, fast, and thrifty in space.

Follow-Up Exercise

Build a parsing table for the grammar from our opening exercise, using the FIRST and FOLLOW sets we built then and the algorithm for building a parsing table we just learned.

The process is straightforward. However, it requires attention to detail.

After you've built the table yourself, compare it to the solution. How does your table compare?

Our Algorithm and LL(1) Parsers

The parsing table for our little expression language has a particular feature that makes the table-driven parsing algorithm work: each cell in the table has exactly one entry. (Remember, the blanks are error states.) This means that the algorithm can make deterministic choices at each point in the algorithm by looking ahead just one token.

LL(1) Grammars

Donald Knuth categorized grammars and parsing techniques using three characteristics:

  • the order in which the parser reads the input stream,
  • the kind of derivation that the parser follows, and
  • the number of tokens that the parser must look ahead to decide what to do next.

These three pieces of information tell us a lot about how a parser or parsing algorithm works.

By this standard, the table-driven parser for our expression grammar is LL(1):

  • the input is scanned from Left to right
  • the algorithm produces a Left derivation
  • the grammar requires 1 token lookahead

We call any grammar for which we can build an LL(1) parser an LL(1) grammar.

The algorithm for building parsing tables that we have just learned guarantees a complete and sound table for any LL(1) grammar. We know this from one of the proofs that we can do about LL(1) grammars and the algorithm.

Recall that complete means the algorithm generates every entry that belongs in the table, and sound means every entry generated by the algorithm belongs in the table.

For a grammar to be LL(1), it must of course be unambiguous. This means it can't be left-recursive or open to left-factoring. In addition, it must satisfy these conditions:

If A := α | β are productions, then
  • FIRST(α) and FIRST(β) are disjoint. There is no terminal a that is in both, nor is ε in both.
  • If ε is derivable from α, then FIRST(β) and FOLLOW(A) are disjoint.

If a grammar is not LL(1), then our table-building algorithm will create cells that contain multiple entries. This means that the parser will have to look ahead more than one token to know which expansion to use. We call such a grammar is LL(k) for some k > 1. There are algorithms capable of parsing LL(k) grammars, but they are less efficient than LL(1) algorithms; they must either look ahead farther or backtrack when they guess wrong.

The Case of the Dangling Else

For some non-LL(1) grammars, we can make arbitrary choices that result in a deterministic parser. Often, these choices match what programmers expect anyway and can be codified in the non-BNF part of a language specification.

Consider the case of the dangling else in this grammar fragment:

statement  := if expression then statement statement'
            | a
statement' := else statement
            | ε

This sort of grammar results from left-factoring a grammar for a language in which the else clause is optional on an if statement. The parsing table for this grammar is:

a b if then else $
S S := a S := if E then S S'
S' S' := else S
S' := ε
S' := ε

M[ S', else ] contains both S' := else S (because FIRST(S') contains else) and S' := ε (because FOLLOW(S') contains else). With two rules in a single cell of the table, the grammar is ambiguous. The parser has two choices when trying to match an S' to a token stream beginning with else.

There is no way to re-write this grammar so that it is LL(1). What can we do?

We can resolve this ambiguity by deciding always to choose S' := else S when in this state. This decision means that we will associate an else clause with the closest open then clause that precedes it. In grammars like this one, choosing S' := ε is almost certainly wrong, as there is no other way to put an else on the stack or consume it from the input stream.

That's why every programming language you know matches an else clause with the nearest open then. It makes the programmer's life easier, but it also simplifies the parser's job!

The Limits of LL(1) Grammars

This is an example of a grammar that cannot be written in LL(1) form. Sometimes, we can resolve an ambiguity by making an arbitrary choice at each point of ambiguity in the parsing table. For a small number of conflicts of a certain kind, this strategy suffices. But it requires human intervention and thus isn't amenable to automation.

Fortunately, though, many grammars with this sort of ambiguity can be written in a way that can be processed efficiently, only not in an LL(1) way. Which feature of the technique should we change?

It turns out that every practical parsing technique scans its input stream from left to right, and that every practical parsing technique requires a lookahead of just 1. But some practical techniques do create rightmost derivations. By Knuth's notation, we characterize these parsers as LR(1). These are the techniques that parse the token stream bottom-up.

The set of of LL(1) grammars is a proper subset of the set of LR(1) grammars. This means that many grammars which cannot be written in LL(1) form can be written in LR(1) form.

Implementing a Table-Driven Parser

For the last few sessions, we have been learning how to construct a table-driven parser. This style of parser consists of a grammar-neutral algorithm operating over a parsing table that encodes the grammar of a particular language. During execution, the parser uses a stack of grammar symbols yet to be matched. How are we to implement these ideas in code?

The algorithm itself is straightforward enough. The stack is a standard data structure, and tables can be implemented using arrays or maps, which are also standard data structures. So perhaps the data structures should be straightforward, too.

Bundled up in building those data structures, though, is how we represent grammar symbols. Then there is the practical issue of initializing a large table, either from a file or in code.

If we hope to create clear code that is easy to work with, implementing the parsing table requires some thought.

Quick Discussion:
  • How are you representing grammar symbols in your parser?
  • How are you representing a grammar rule in your parser?
  • What alternatives have you considered?

Issues that affect our choices include:

You also have to answer questions about how to represent and initialize your parsing table. For instance, how might you represent table entries in a data file, so that you could load them into your program?

I strongly encourage you your team to think through the issues and to design a parser you all understand well and are comfortable with. You will be spending a lot of time inside the parser code for the next few weeks, and you'll need to be able to maintain it for the rest of the project.

Notes on Your Project

Module 1 Feedback

Good work by all. The basic machinery of each team's scanner works pretty well, and the rest of submissions were in pretty good shape. A couple of the Klein programs you submitted contained syntax errors, but that is natural as we learn the language. Soon, your own parsers will find these bugs for us!

Module 2 Status Check

Don't forget that the status check for Module 2 is do sometime tomorrow.

Refactoring the Klein Grammar

One of your tasks for Module 2 is to refactor the Klein grammar so that we can build a parsing table for it.

Most of us have eliminated left recursion to create rules such as:

EXPRESSION ::= SIMPLE-EXPRESSION
EXPRESSION ::= SIMPLE-EXPRESSION "==" EXPRESSION
EXPRESSION ::= SIMPLE-EXPRESSION "<" EXPRESSION

... and then left-factored items with several possible endings to:

EXPRESSION ::= SIMPLE-EXPRESSION EXP-REST
EXP-REST   ::= ε
             | "==" EXPRESSION
             | "<" EXPRESSION

These rules work fine for this module. If you would like to save yourself a bit of work on the next module, though, I suggest this for EXP-REST and other x-REST symbols:

EXPRESSION ::= SIMPLE-EXPRESSION EXP-REST
EXP-REST   ::= ε
             | "==" SIMPLE-EXPRESSION EXP-REST
             | "<" SIMPLE-EXPRESSION EXP-REST

All I did was replace EXPRESSION with SIMPLE-EXPRESSION EXP-REST, which is what it expands to. The parser will make this move at run-time. We are accelerating the process by inlining the derivation.

This refactoring doesn't change the FIRST sets or the FOLLOW sets. However, it will make it easier for us to do what we need to do on the next part of the project.