Session 12
Building Abstract Syntax in the Parser
Coincidence
Many years ago +, a curious tweet appeared in my Twitter stream the morning of this session:
Back when Twitter was a more wholesome place...
if you write a parser for some application,
you probably have too much spare time on your hands
I'm guessing that none of you have any spare time on your hands right now, precisely because I asked you to write a parser. The tweet has since been deleted, perhaps because the author remembered how often people roll special-purpose parsers with great outcomes. Or maybe they simply felt guilty for discouraging compiler students!
Exercise: Defining Abstract Syntax
In the opening exercise for Session 9, I gave you
a context-free grammar
for a small imperative language and asked you to write a
recursive-descent parser for the language. We assumed the
existence of a match(Token) function that returns
true and consumes a token, or returns false. The result was
code something like this.
Last session, we began our discussion of parsers that can generate abstract syntax trees and talked about defining the abstract syntax for a grammar.
statement := repeat statement until expression
| print identifier
| identifier <- expression
expression := identifier = expression
| zero? expression
| number
Name each type of node and list its components.
Solution: Defining Abstract Syntax
There are two non-terminals in the language,
statement and expression. These
non-terminals are abstractions: each is a name for
a kind of thing, but the actual things are instances
of more specific kinds of thing. For example, a statement has
to be a repeat statement, a print statement, or an assignment
statement. There is nothing in common to these three concrete
types. Likewise, there is no such thing as a concrete
expression. There are equality tests, zero tests, and numbers,
all of which are kinds of expression.
So, our abstract syntax will consist of six kinds of record:
- a repeat statement
- a print statement
- an assignment statement
- an equals test expression
- a zero test expression
- a number
When we define these records in code, we can create superclasses
or interfaces for the non-terminals statement and
expression, so that we can write code that deals
generically with statements and expressions.
Here are example implementations of statements and expressions classes in Java.
Exercise: Extending the Parser
Now that we have abstract syntax for the grammar, let's use it to build trees.
You may again assume that we have already written all of the
infrastructure you need, in particular six simple classes or
structs or records that correspond to the six
kinds of things in our grammar.
A Parser That Returns a Tree
Introduction
Our abstract syntax consists of two interfaces, Statement and Expression, with three concrete types of each.
The current parser returns true or false. Our new parser needs to return an abstract syntax tree — or not. So it doesn't just recognize statements and expressions; it will also build tree.
An AST Node for Identifiers
When the current parser matches a token, it moves onto the next token. Identifiers and numbers have semantic content that must be included in the abstract syntax tree: the number's numeric value and the identifier's string value. So our new parser can't simply consume these tokens any more. It also needs to record their values as nodes in the tree.
We already have a node for numbers, because they appear as a kind of expression. But identifiers only show up as parts of other nodes. Even so, we need a node to hold their content. That means there are also seven kinds of things in our grammar: the seventh is the identifier node!
Modifying the Parser's Approach
We might modify the parser in this way. We make two main kinds of change to the algorithm:
- Rather than use a conjunction of token and non-terminal matches, we use a sequence of matches.
- When we match a non-terminal, or a token with semantic content (a number or an identifier), we store its value for use in constructing a new node in the tree.
The recursive descent technique still provides most of the machinery we need. With only a few changes, we have a working design.
Our old parser returned "false" when it failed to recognize a program. In the new parser, successful parses will return nodes for the abstract syntax tree. What should the default case return?
Perhaps null, which becomes an indicator for no
match. This raises other questions, though, related to
invalid programs as input: What should happen if, say, a
token match or expression match fails? What should the parser
do? How might we improve it?
Implementing the Parser in Code
As with the Session 9 version of this problem, turning this
design into
a legal Java parser
is not too difficult. The biggest job this time is to handle
the error cases, either by checking for null or
throwing exceptions. I opted to throw exceptions. My demo
catches the exceptions and prints an error message.
As noted when we discussed
opening exercise,
in order to turn this into real code, we have to implement
the abstract syntax of the language. For this demo, I kept
it quite simple: a set of statement nodes that implement
a Statement interface
and a set of expression nodes that implement
an Expression interface.
A complete implementation would specify appropriate access
methods in the interfaces, and perhaps a method for turning
the node into a string or other presentable object,
a lá:
repeat print a until b = zero? 1
Today's zip file contains the parser, the AST classes, and the driver, along with mock token and scanner classes for testing it. (The scanner really *is* a mock: it reads from a list of tokens!) The parser runs — give it a try!
Recap: Abstract Syntax and Semantic Actions
Is a function from strings
To lists of pairs
Of things and strings.
— a poem by Graham Hutton
We are discussing the syntax analysis phase of a compiler. We have begun to learn how a parser can return as its result the syntactic structure of the program given to it. In Session 11, we saw two possible kinds of output: the parse tree and the abstract syntax tree (AST). Parse trees are verbose and contain a lot of syntactic noise, so we prefer the more compact AST. The AST is our first intermediate representation of a source program. It serves as the primary input to all later phases in the compiler.
If you would like to read another discussion of the difference between concrete and abstract syntax, and why we usually prefer abstract syntax, see Eli Bendersky's Abstract versus Concrete Syntax Trees.
At the end of last session, we considered some of the important issues involved in augmenting our table-driven parsing technique so that it creates an AST as a side effect of recognizing valid programs. We identified two key issues:
- First, we need the ability to associate an instruction to create a node for the abstract syntax tree with some of the production rules in the grammar. We call such an instruction a semantic action.
- Second, a table-driven parser will need a new data structure in which to store nodes that have been constructed but not yet integrated into the tree. We call this structure the the semantic stack.
Let's get down to the details. First, we will outline the changes to be made. Then we will make them.
Adding Semantic Actions to a Table-Driven Parser
In order to build an abstract syntax tree, we need to make four modifications to our table-driven approach.
Two of these changes are specific to the grammar being processed:
-
Add a semantic action to each rule that corresponds to
an element in the abstract syntax.
Often, a semantic action comes at the end of the grammar rule. The algorithm will push it onto the parse stack first, ahead of the rule's parse actions, so that it is processed after the rest of the rule's right hand side.
-
Write a factory method for each type of AST node.
A semantic action will invoke one of these methods when the algorithm executes it, popping zero or more items off of the semantic stack, creating a new node for the AST, and placing it back on the semantic stack.
The semantic actions that we add to the grammar do not affect any other uses of the grammar. In particular, they have no effect on how we create FIRST and FOLLOW sets or on how we build the parse table. They are only annotations to be used by the extended parsing algorithm.
The other two changes are general, to the parsing algorithm itself:
-
Create a semantic stack and initialize it to empty.
This stack holds the results of executing semantic actions and provides the values they need to do their work.
-
Add an arm to handle semantic actions.
When the parser encounters a semantic action on top of its parsing stack, it will "execute" the action, calling its factory method to create a part of the AST.
Now that we have two stacks, let's call the original stack the parse stack, because it keeps track of the data used by the parsing process itself. In this approach, semantic actions are pushed onto the parse stack, along with grammar symbols. Each semantic action operates on the semantic stack, popping zero or more nodes from it and pushing a new AST node on to it. The semantic stack holds the products of the semantic actions.
The new algorithm needs a bit more memory than just the new stack. It also needs a temporary variable to remember the content of the most recent terminal it matched. We might call this variable matched terminal. When a terminal matches against an input token, the algorithm must retain the value of the token it matched until it invokes the semantic action to make a node (say, an identifier or a number literal).
So, a semantic action will operate not only on the semantic stack but also on a matched terminal.
To illustrate this process, we will again use our familiar little arithmetic grammar as a running example:
expression := term expression'
expression' := + expression
| ε
term := factor term'
term' := * term
| ε
factor := ( expression )
| identifier
What is the abstract syntax of this language? It contains addition expressions, multiplication expressions, and identifiers. We will need nodes for each of these in our ASTs.
Note that we do not have to represent parenthesized expressions as a kind of thing in the AST of a program. Parentheses give the programmer a way to tell the parser how to build the AST, but they do not add any other meaning to the program. When we represent a parenthesized expression in the tree, the shape of the tree will encode that information. Our semantic actions need to generate AST nodes that reflect the order of operations specified by the language and the program.
Now let's make the four changes outlined above. First, we will modify the parsing algorithm itself. These changes make it possible to build abstract syntax. Then we will make the changes specific to the grammar of our little language.
Changes to the Table-Driven Parsing Algorithm
The first step is straightforward: before entering the parsing loop, we create an empty semantic stack and a matched-terminal variable, along with the empty parse stack.
The second is to add an arm to the algorithm to handle semantic actions. The parser operates as it did before, based on the symbol A on top of the parse stack and the next token t in the input stream. But now the symbol on top of the parse stack may be a semantic action, which signals the need to create a new node for the AST.
Here is the augmented algorithm. After we create an empty parse stack, an empty semantic stack and an empty matched-terminal variable, we...
- Push the end of stream symbol, $, onto the parse stack.
- Push the start symbol onto the parse stack.
-
Repeat
-
A is a terminal.
If A == t, then- Pop A from the parse stack and consume t. If the terminal has a value, store it in the matched-terminal variable.
- Else we have a token mismatch, so fail.
-
A is a non-terminal.
If table entry [A, t] contains a ruleA := Y1 Y2 ... Yn, then- Pop A from the parse stack. Push Yn Yn-1 ... Y1 onto the parse stack, in that order.
- Else there is no transition for this pair, so fail.
-
A is a semantic action.
Pop A from the parse stack and apply it to the semantic stack.
-
A is a terminal.
- Return the value on top of the semantic stack.
Notice that we now have to distinguish between the two stacks at each step. The only step that refers to the semantic stack is the new step, which applies the semantic action encountered on the top of the parse stack. The action will build a new AST node, perhaps consuming nodes already on the semantic stack, and push the result onto the semantic stack.
Note: In a practical sense, the algorithm is still incomplete. It recognizes failures in the cases of a token mismatch and a missing entry in the parse table, but what other kinds of errors can occur? When and how should the algorithm recognize and signal them?
Once the parsing algorithm is modified, it will work for any parsing table created from a suitable grammar.
Changes to the Language Grammar
Next, we have to make two changes to whatever language grammar we wish to build a parser for.
First, we augment the grammar with semantic actions for creating AST nodes. Given the three kinds of abstract syntax in our expression language, we need to add semantic actions to the three rules that recognize them.
We will place the semantic action at the point in a grammar rule when we have recognized all of the parts needed to construct an AST node. This will often be at the end of a grammar rule, but not always. In this grammar, we have recognized...
- all of the parts of a multiplication expression as soon as we recognize the factor that follows the * operator, and
- all of the parts of an addition expression as soon as we recognize the term that follows the + operator.
We want to build the nodes then, rather than at the end of the rule. This requires us to expand some non-terminals once by hand on the right hand sides of the binary operator rules. (We will consider other possible wrinkles with parsing these expressions next time.)
First, we expand the expression and term symbols on the right hand side of the rules once by hand, to expose the right operands associated with the operators:
expression := term expression'
expression' := + term expression'
| ε
term := factor term'
term' := * factor term'
| ε
factor := ( expression )
| identifier
Then we add the semantic actions:
expression := term expression'
expression' := + term make-plus-exp expression'
| ε
term := factor term'
term' := * factor make-times-exp term'
| ε
factor := ( expression )
| identifier make-identifier
Expanding the non-terminals and adding the semantic actions do not affect how we build the parsing table, so it remains the same, with the augmented rules in place of the originals:
| . | identifier | + | * | ( | ) | $ |
|---|---|---|---|---|---|---|
| E | TE' |
. | . | TE' |
. | . |
| E' | . | +T make-+ E' |
. | . | ε |
ε |
| T | FT' |
. | . | FT' |
. | . |
| T' | . | ε |
*F make-* T' |
. | ε |
ε |
| F | identifier make-id |
. | . | ( E ) |
. | . |
Finally, we write factory methods for each kind of abstract syntax. This is straightforward enough; these will be constructors for the records or objects that make up our representation. We will talk more about writing code to implement the AST next session.
Watching the New Table-Driven Parser at Work
Let's trace this new version of the algorithm as it recognizes the token stream:
x + y * z
This is the same expression we recognized back in Session 9, with generic identifier tokens replaced with real identifiers. Let's re-label the "Stack" column in our table as the Parse Stack, to reflect the fact that we now have two stacks to consider, and add a new column called Semantic Stack in place of the Rule Matched column.
| Parse Stack | Input | Semantic Stack |
|---|---|---|
$ |
x + y * z $ |
. |
$E |
x + y * z $ |
. |
$E'T |
x + y * z $ |
. |
$E'T'F |
x + y * z $ |
. |
$E'T' m-id id |
x + y * z $ |
. |
$E'T' m-id |
+ y * z $ |
. |
$E'T' |
+ y * z $ |
[x] |
$E' |
+ y * z $ |
[x] |
$E' m-+ T + |
+ y * z $ |
[x] |
$E' m-+ T |
y * z $ |
[x] |
$E' m-+ T'F |
y * z $ |
[x] |
$E' m-+ T' m-id id |
y * z $ |
[x] |
$E' m-+ T' m-id |
* z $ |
[x] |
$E' m-+ T' |
* z $ |
[x] [y] |
$E' m-+ T' m-* F * |
* z $ |
[x] [y] |
$E' m-+ T' m-* F |
z $ |
[x] [y] |
$E' m-+ T' m-* m-id id |
z $ |
[x] [y] |
$E' m-+ T' m-* m-id |
$ |
[x] [y] |
$E' m-+ T' m-* |
$ |
[x] [y] [z] |
$E' m-+ T' |
$ |
[x] [ [y] * [z] ] |
$E' m-+ |
$ |
[x] [ [y] * [z] ] |
$E' |
$ |
[ [x] + [ [y] * [z] ] ] |
$ |
$ |
[ [x] + [ [y] * [z] ] ] |
[ [x] + [ [y] * [z] ] ] |
The main loop of the algorithm terminates when the parse stack goes empty. The algorithm then returns the AST on the semantic stack, which represents the input sequence. The brackets in the semantic stack show the grouping of parts in the AST. The result matches the precedence rules implicit in the grammar.
Notice how the algorithm uses the new matched-terminal variable. In this example, matching an identifier against an x, y, or z requires that parser remember the actual identifier that was matched until it sees a make-identifier semantic action. This will also be true for other terminals with content, such as numeric literals.
We can think of the semantic action operating on a pair:
(semantic stack, matched terminal). Some
semantic actions, such as make-identifier, ignore the
value of the semantic stack because they get the value they need
from the matched terminal variable. Other semantic actions,
such as make-plus-exp and make-times-exp, ignore
the value of the matched terminal variable because they get all
the values they need off of the semantic stack.
This tells us something about how to implement semantic actions in our program... Next time!
Closing Exercise
We want the * operator to bind more tightly than the
+. In the case of x + y * z, it did.
But was that an accident? Maybe it worked properly because the
leftmost operator became the root of the tree. Will the parser
construct the correct AST for a program in which the operators
are swapped?
We know from how the algorithm works that the make-* semantic action will be pushed onto the parse stack first. Will it get buried beneath a make-+ action that gets executed first?
There is one good way to find out: trace the algorithm! So...
x * y + z
After working through this on your own... check out this trace. More next time.