Session 9
Table-Driven Parsing
Opening Exercise
Last time, we learned that a recursive-descent parser includes one function for each non-terminal in the grammar.
statement := repeat statement until expression
| print identifier
| identifier ← expression
expression := identifier = expression
| zero? expression
| number
statement and expression are the
only non-terminals. The other symbols are all terminal
tokens.
Write either function for this grammar.
The idea of the exercise is to get the structure of the parser
right, so make any reasonable assumptions about tokens and the
scanner that you want. In particular, you may assume that we
have already defined a procedure named
match(token) that takes a token as an
argument. It reads the next token from the scanner, returning
true when that token matches its argument, and false otherwise:
public boolean match( Token expected )
{
Token next = scanner.next();
return ( expected == next );
}
Building a Solution
We can write a predictive parser for this grammar because all of its choices are deterministic and keyed by a token. Here is a quick outline in Java-like pseudocode:
public boolean statement()
{
switch ( scanner.peek() )
{
Token.repeat: ...
Token.print: ...
Token.identifier: ...
default: return false;
}
}
public boolean expression()
{
switch ( scanner.peek() )
{
Token.identifier: ...
Token.zeroPredicate: ...
Token.number: ...
default: return false;
}
}
To implement each case, we call match() and these
two functions according to the corresponding grammar rule.
For example, in expression(), to recognize a
number all we need to do is match the next token:
Token.number :
return match(Token.number);
The zero predicate case has to match a token and a sub-expression:
Token.zeroPredicate :
return match(Token.zeroPredicate) && expression();
The repeat case in statement() has the most parts
to test:
Token.repeat :
return match(Token.repeat) && statement() &&
match(Token.until) && expression();
Bonus question: What would we have to do if we added a stand-alone identifier as a type of expression? (Dad-joke hint: Something gets "left" behind...) What effect would this have on our parser?
Turning this into
a legal Java parser
is not too difficult. Had I defined my tokens using a Java
enum instead of private instances, I could even
have used a Java switch expression.
Today's code archive
contains the parser, a mock scanner and token classes for
testing it, with a driver.
Recap: Strategies for Parsing
We are discussing the syntax analysis phase of a compiler. Last time, we discussed how parsers can work either top-down from an initial goal of the start symbol of a grammar or bottom-up from sequences of terminal tokens.
If the grammar allows two ways to match a non-terminal that cannot be distinguished by their prefixes, then a top-down parser will have to backtrack. Often, though, we can build a predictive parser that never has to backtrack, by refactoring the grammar to make it unambiguous.
One class of top-down parsers can be implemented using a technique called recursive descent, in which the parser calls a procedure for matching each non-terminal it needs to recognize. These procedures will call other such procedures, descending through a tree of goals down to terminal matches.
We can implement predictive parsers in a straightforward way using a collection of finite-state machines for recognizing non-terminals. Recursive descent does this by using the implementing language's procedure call as a way to save the state of the calling machine. This makes the code easy to write but less efficient at run-time than we often desire.
We can also build a predictive parser that does not need to use recursion by implementing and maintaining our own stack of states. As is often the case, careful construction of a pushdown automaton by hand can result in a more efficient parser than results from an implicit automaton generated by recursion.
We ended last session by introducing the idea of a table-driven parser, which uses a table to record which production to apply in any given state, and a parsing algorithm. This algorithm requires no recursion or even any function calls to recognize an input stream. It grows and shrinks its stack on its own and loops through the table, which encodes the grammatical knowledge that is distributed across the separate procedures of a recursive-descent parser.
An Example of Table-Driven Parsing
Consider this simple grammar for arithmetic expressions:
expression := term + expression
| term
term := factor * term
| factor
factor := ( expression )
| identifier
First, notice how this grammar encodes the common order of
precedence for the operators (),
*, and +.
It nests a higher-precedence operation deeper in the grammar
with a new non-terminal. A factor must be recognized before
the term that contains it, which ensures that the parentheses
bind before the multiplication. Likewise, multiplication will
bind before addition.
The Klein grammar does something similar, with more levels of precedence and a larger set of expression types.
This grammar matches how we like to write expressions, but its left hand ambiguities create problems for a predictive parser. So we left-factor it to create this equivalent grammar:
expression := term expression'
expression' := + expression
| ε
term := factor term'
term' := * term
| ε
factor := ( expression )
| identifier
A parsing table for this grammar must identify which rule to
apply for each (non-terminal, token) combination. When
processing a program, the parser will eventually reach the end
of the token stream, which we indicate with the pseudo-token
$.
Soon, we will learn how to construct such a parsing table for any grammar, but for now here is a parsing table for the expression grammar:
| . | identifier | + | * | ( | ) | $ |
|---|---|---|---|---|---|---|
| E | E := TE' |
E := TE' |
||||
| E' | E' := +E |
E' := ε |
E' := ε |
|||
| T | T := FT' |
T := FT' |
||||
| T' | T' := ε |
T' := *T |
T' := ε |
T' := ε |
||
| F | F := identifier |
F := ( E ) |
Each cell has at most one entry, so the table is predictive. For each combination of non-terminal and token, the parser has exactly one grammar rule to apply. Any cell without an entry indicates a parsing error, because there is no grammar rule that corresponds to its combination of non-terminal and token.
Now let's trace our table-driven algorithm as it recognizes this token stream:
identifier + identifier * identifier $
| Stack | Input | Rule matched |
|---|---|---|
$E |
identifier + identifier * identifier $ |
E := TE' |
$E'T |
identifier + identifier * identifier $ |
T := FT' |
$E'T'F |
identifier + identifier * identifier $ |
F := identifier |
$E'T'identifier |
identifier + identifier * identifier $ |
|
$E'T' |
+ identifier * identifier $ |
T' := ε |
$E' |
+ identifier * identifier $ |
E' := +E |
$E+ |
+ identifier * identifier $ |
|
$E |
identifier * identifier $ |
E := TE' |
$E'T |
identifier * identifier $ |
T := FT' |
$E'T'F |
identifier * identifier $ |
F := identifier |
$E'T'identifier |
identifier * identifier $ |
|
$E'T' |
* identifier $ |
T' := *T |
$E'T* |
* identifier $ |
|
$E'T |
identifier $ |
T := FT' |
$E'T'F |
identifier $ |
F := identifier |
$E'T'identifier |
identifier $ |
|
$E'T' |
$ |
T' := ε |
$E' |
$ |
E' := ε |
$ |
$ |
Notice that the algorithm follows a left derivation of the input. Top-down parsers generally follow left derivations, because they try to match a non-terminal to the leftmost token in the stream.
Parsing with a Table
Table-driven parsing shifts the knowledge for parsing a grammar out of the procedures that recognize each non-terminal and into data entries in a table, which are then processed by a single algorithm. Not surprisingly, then, the key to building a table-driven parser lies in constructing a suitable parsing table.
This is a common theme that runs through the undergraduate CS curriculum, from the trade-off between procedure-based and data-based implementations in Programming Languages to techniques such as the strategy design pattern in object-oriented programming. It also plays an important role in knowledge representation, a core area of artificial intelligence, and is fundamental to the idea of an inference engine. This is one of the Big Ideas of computer science.
We can build the parsing table for simple grammars like
the one above
simply by working through the grammar carefully. But how do we
know that the table is right? Looking at my table, you may not
see immediately why we have all the table entries we do. For
example, how did I know to have entries in row E
for the identifier and
( tokens and
none for the others?
In this case, the answer is that...
- all expressions begin with terms,
- all terms begin with factors, and
-
all factors either are identifiers
or begin with
(.
Or, perhaps more perplexing, how did I know to put entries for
E' := ε in the columns for
)
and $
and not the others?
Reasoning in this way about a language as large as Klein would be error-prone indeed! What can we do instead?
If we want to build a parsing table for a more complex grammar, having a more formal set of rules for the process would help us to avoid errors and to ensure completeness. Before we go any farther, let's define a more formal set of rules to guide our reasoning.
The FIRST and FOLLOW Functions
We can formalize the process of building a parsing table in terms of two functions, FIRST and FOLLOW. First, we compute these functions for a grammar. Then, we use the functions to generate a complete and correct parsing table — mechanically.
Let's try to understand the basic idea behind FIRST and FOLLOW sets before moving on to algorithms for building them.
FIRST
Let s be any string of grammar symbols. FIRST(s) is the set of terminals that begin strings derived from s. If we can derive ε from α, then ε is in FIRST(s), too.
Consider the grammar from our table-driven example above. FIRST(() = { ( }, because terminals cannot be expanded into anything but themselves.
Then consider the non-terminal factor. There are two production rules for factor, one starting with identifier and the other with (. This means that FIRST(factor) = { identifier, ( }.
What about non-terminals defined by rules starting with other non-terminals, such as term and expression? As we reasoned above, all expressions begin with terms, and terms begin with factors. That means FIRST(expression) = FIRST(term) = FIRST(factor) = { identifier, ( }.
Quick Exercise: What is FIRST(expression')?
The presence of ε rules create a problem for our parser.
If it is looking for an expression' and sees a
+, then it can use the
expression' := +E rule. But when can it
use the ε rule? Not every grammar symbol is legal when
trying to find an expression'! This is why we need
FOLLOW sets, too.
FOLLOW
Let A be any non-terminal. FOLLOW(A) is the set of terminals that can appear immediately to the right of A in a derivation. If we can derive ε from the symbols that separate A and some terminal a, then a is in FOLLOW(A), too.
Again, consider the same grammar. What can follow a term? Well...
- The first two rules show that a term can be followed by a +.
- Rules 1, 3, and 7 show that a term can be followed by a ), when it ends a parenthesized expression.
- Rules 1 and 3 show that a term can be followed by a $ when it ends a top-level expression.
So FOLLOW(term) = { +, ), $ }.
Quick Exercise: What is FOLLOW(factor)?
Hint: A factor can end a term, so its FOLLOW set must at least contain all the symbols in FOLLOW(term)!
~~~~~
As we will see soon, these two sets can help us build a parsing table in a reliable, repeatable way. But we don't want to have to build the sets in such an ad hoc fashion, either. So, let's go one more step and formalize the construction of the sets.
Building the FIRST Sets
To compute FIRST sets, use these rules. For each grammar symbol X,
- If X is a terminal, then FIRST(X) = { X }.
-
If there is a rule
X := ε, then add ε to FIRST(X). -
If there is a rule
X := Y1 Y2 ... Yn, then add all the non-ε symbols from FIRST(Y1) to FIRST(X).-
If there is a rule
Y1 := ε, then add all the non-ε symbols from FIRST(Y2) to FIRST(X). -
Likewise, if that is true and there is a rule
Y2 := ε, then add all the non-ε symbols from FIRST(Y3) to to FIRST(X). - Continue in this way until you find a Yi that does not generate an ε. If no such Yi exists, add ε to FIRST(X).
-
If there is a rule
We can use these rules to generate a complete listing of FIRST sets by proceeding through all the production rules, recurring each time we process a symbol in a production. Rule 3 tells us that we should build the sets bottom up, starting with the lowest-level non-terminals in the grammar.
For
our example grammar,
we might build the FIRST sets in this order:
- FIRST(factor) = { identifier, ( }
- FIRST(term') = { *, ε }
- FIRST(term) = FIRST(factor) = { identifier, ( }
- FIRST(expression') = { +, ε }
- FIRST(expression) = FIRST(b) = { identifier, ( }
Building the FOLLOW Sets
Now we can define rules for computing FOLLOW sets, using FIRST sets to help us. Let α and β be arbitrary strings of grammar symbols. For each grammar symbol X:
- If X is the start symbol, then put $ in FOLLOW(X).
-
If there is a rule
Y := α X βor a ruleY := X β, then put everything from FIRST(β) into FOLLOW(X) — except for ε. -
If there is a rule
Y := α Xor a ruleY := α X βwhere FIRST(β) contains ε, then put everything from FOLLOW(Y) into FOLLOW(X).
We can use these rules to generate a complete listing of FOLLOW sets by proceeding through the production rules on at a time. This time, Rule 3 tells us that we should build the sets top-down, starting with highest-level non-terminals.
For our example grammar, we might build the FOLLOW sets in this order:
-
expression
- Rule 1 says that $ is in FOLLOW(expression).
- Rule 2 + Production 7 say that FOLLOW(expression) contains FIRST()), which is ).
-
expression'
- Rule 3 + Production 1 say that FOLLOW(expression') == FOLLOW(expression)
-
term
- Rule 2 + Production 2 say that FOLLOW(term) contains FIRST(expression') - { ε }, which is { + }.
- Rule 3 + Productions 1-3 say that FOLLOW(term) contains FOLLOW(expression) and FOLLOW(expression'), which are the same, { ), $ }.
-
term'
- Rule 3 + Production 4 say that FOLLOW(term') == FOLLOW(term)
-
factor
- Rule 2 and Production 4 say that FOLLOW(factor) contains FIRST(term') - { ε }, which is { * }.
- Rule 3 + Productions 4-6 say that FOLLOW(factor) contains FOLLOW(term) and FOLLOW(term'), which are the same, { +, ), $ }.
The Payoff
This process can be tedious, but it is mechanical, repeatable, and guaranteed to produce complete and accurate FIRST and FOLLOW sets. Because it is tedious but mechanical and repeatable, this process is a perfect candidate for automation in a computer program.
These sets are not our goal; they are only tools for achieving our goal: building a parsing table for a grammar. These sets enable us to build a parsing table that is both complete and sound.
By complete we mean:
If a sentence is in the language, the parser accepts it.
By sound we mean:
If the parser accepts a sentence, then it is in the language.
Our next step is to reap the fruits of our labor building FIRST and FOLLOW sets by using them to construct a parsing table. We will do that in Session 10.
Study today's notes carefully. If you would like to see another discussion of these topics, check out the Thain text. It covers FIRST and FOLLOW sets in Section 4.3.3, and demonstrates them using an example grammar.
You can also practice on other grammars: ones we used in previous sessions, or even today's opening exercise. After you have refactored Klein's grammar, you can begin applying the ideas to your project!