Session 12
Building Abstract Syntax in the Parser

Coincidence

Many years ago +, a curious tweet appeared in my Twitter stream the morning of this session:

if you write a parser for some application,
you probably have too much spare time on your hands

I'm guessing that none of you have any spare time on your hands right now, precisely because I asked you to write a parser. The tweet has since been deleted, perhaps because the author remembered how often people roll special-purpose parsers with great outcomes. Or maybe they simply felt guilty for discouraging compiler students!

Exercise: Defining Abstract Syntax

Refer to this handout for the grammar and parser used in the exercises.

In the opening exercise for Session 9, I gave you a context-free grammar for a small imperative language and asked you to write a recursive-descent parser for the language. We assumed the existence of a match(Token) function that returns true and consumes a token, or returns false. The result was code something like this.

Last session, we began our discussion of parsers that can generate abstract syntax trees and talked about defining the abstract syntax for a grammar.

Define the abstract syntax for the grammar of this language.
statement  := repeat statement until expression
            | print identifier
            | identifier <- expression
expression := identifier = expression
            | zero? expression
            | number

Name each type of node and list its components.

Solution: Defining Abstract Syntax

There are two non-terminals in the language, statement and expression. These non-terminals are abstractions: each is a name for a kind of thing, but the actual things are instances of more specific kinds of thing. For example, a statement has to be a repeat statement, a print statement, or an assignment statement. There is nothing in common to these three concrete types. Likewise, there is no such thing as a concrete expression. There are equality tests, zero tests, and numbers, all of which are kinds of expression.

So, our abstract syntax will consist of six kinds of record:

When we define these records in code, we can create superclasses or interfaces for the non-terminals statement and expression, so that we can write code that deals generically with statements and expressions.

Here are example implementations of statements and expressions classes in Java.

Exercise: Extending the Parser

Now that we have abstract syntax for the grammar, let's use it to build trees.

What changes do we have to make to the parser in order to return the abstract syntax tree of a program that it recognizes?

You may again assume that we have already written all of the infrastructure you need, in particular six simple classes or structs or records that correspond to the six kinds of things in our grammar.

A Parser That Returns a Tree

Our abstract syntax consists of two interfaces, Statement and Expression, with three concrete types of each.

The current parser returns true or false. Our new parser needs to return an abstract syntax tree — or not. So it doesn't just recognize statements and expressions; it will also build tree.

An AST Node for Identifiers

When the current parser matches a token, it moves onto the next token. Identifiers and numbers have semantic content that must be included in the abstract syntax tree: the number's numeric value and the identifier's string value. So our new parser can't simply consume these tokens any more. It also needs to record their values as nodes in the tree.

We already have a node for numbers, because they appear as a kind of expression. But identifiers only show up as parts of other nodes. Even so, we need a node to hold their content. That means there are also seven kinds of things in our grammar: the seventh is the identifier node!

Modifying the Parser's Approach

We might modify the parser in this way. We make two main kinds of change to the algorithm:

  • Rather than use a conjunction of token and non-terminal matches, we use a sequence of matches.
  • When we match a non-terminal, or a token with semantic content (a number or an identifier), we store its value for use in constructing a new node in the tree.

The recursive descent technique still provides most of the machinery we need. With only a few changes, we have a working design.

Our old parser returned "false" when it failed to recognize a program. In the new parser, successful parses will return nodes for the abstract syntax tree. What should the default case return?

Perhaps null, which becomes an indicator for no match. This raises other questions, though, related to invalid programs as input: What should happen if, say, a token match or expression match fails? What should the parser do? How might we improve it?

Implementing the Parser in Code

As with the Session 9 version of this problem, turning this design into a legal Java parser is not too difficult. The biggest job this time is to handle the error cases, either by checking for null or throwing exceptions. I opted to throw exceptions. My demo catches the exceptions and prints an error message.

As noted when we discussed opening exercise, in order to turn this into real code, we have to implement the abstract syntax of the language. For this demo, I kept it quite simple: a set of statement nodes that implement a Statement interface and a set of expression nodes that implement an Expression interface. A complete implementation would specify appropriate access methods in the interfaces, and perhaps a method for turning the node into a string or other presentable object, a lá:

repeat
  print a
  until b = zero? 1

Today's zip file contains the parser, the AST classes, and the driver, along with mock token and scanner classes for testing it. (The scanner really *is* a mock: it reads from a list of tokens!) The parser runs — give it a try!

Recap: Abstract Syntax and Semantic Actions

Thing 1 and Thing 2, from The Cat in the Hat
A parser for things
Is a function from strings
To lists of pairs
Of things and strings.
a poem by Graham Hutton

We are discussing the syntax analysis phase of a compiler. We have begun to learn how a parser can return as its result the syntactic structure of the program given to it. In Session 11, we saw two possible kinds of output: the parse tree and the abstract syntax tree (AST). Parse trees are verbose and contain a lot of syntactic noise, so we prefer the more compact AST. The AST is our first intermediate representation of a source program. It serves as the primary input to all later phases in the compiler.

If you would like to read another discussion of the difference between concrete and abstract syntax, and why we usually prefer abstract syntax, see Eli Bendersky's Abstract versus Concrete Syntax Trees.

At the end of last session, we considered some of the important issues involved in augmenting our table-driven parsing technique so that it creates an AST as a side effect of recognizing valid programs. We identified two key issues:

Let's get down to the details. First, we will outline the changes to be made. Then we will make them.

Adding Semantic Actions to a Table-Driven Parser

In order to build an abstract syntax tree, we need to make four modifications to our table-driven approach.

Two of these changes are specific to the grammar being processed:

The semantic actions that we add to the grammar do not affect any other uses of the grammar. In particular, they have no effect on how we create FIRST and FOLLOW sets or on how we build the parse table. They are only annotations to be used by the extended parsing algorithm.

The other two changes are general, to the parsing algorithm itself:

Now that we have two stacks, let's call the original stack the parse stack, because it keeps track of the data used by the parsing process itself. In this approach, semantic actions are pushed onto the parse stack, along with grammar symbols. Each semantic action operates on the semantic stack, popping zero or more nodes from it and pushing a new AST node on to it. The semantic stack holds the products of the semantic actions.

The new algorithm needs a bit more memory than just the new stack. It also needs a temporary variable to remember the content of the most recent terminal it matched. We might call this variable matched terminal. When a terminal matches against an input token, the algorithm must retain the value of the token it matched until it invokes the semantic action to make a node (say, an identifier or a number literal).

So, a semantic action will operate not only on the semantic stack but also on a matched terminal.

To illustrate this process, we will again use our familiar little arithmetic grammar as a running example:

expression  := term expression'
expression' := + expression
             | ε
term        := factor term'
term'       := * term
             | ε
factor      := ( expression )
             | identifier

What is the abstract syntax of this language? It contains addition expressions, multiplication expressions, and identifiers. We will need nodes for each of these in our ASTs.

Note that we do not have to represent parenthesized expressions as a kind of thing in the AST of a program. Parentheses give the programmer a way to tell the parser how to build the AST, but they do not add any other meaning to the program. When we represent a parenthesized expression in the tree, the shape of the tree will encode that information. Our semantic actions need to generate AST nodes that reflect the order of operations specified by the language and the program.

Now let's make the four changes outlined above. First, we will modify the parsing algorithm itself. These changes make it possible to build abstract syntax. Then we will make the changes specific to the grammar of our little language.

Changes to the Table-Driven Parsing Algorithm

The first step is straightforward: before entering the parsing loop, we create an empty semantic stack and a matched-terminal variable, along with the empty parse stack.

The second is to add an arm to the algorithm to handle semantic actions. The parser operates as it did before, based on the symbol A on top of the parse stack and the next token t in the input stream. But now the symbol on top of the parse stack may be a semantic action, which signals the need to create a new node for the AST.

Here is the augmented algorithm. After we create an empty parse stack, an empty semantic stack and an empty matched-terminal variable, we...

  1. Push the end of stream symbol, $, onto the parse stack.
  2. Push the start symbol onto the parse stack.
  3. Repeat
    • A is a terminal.
      If A == t, then
      1. Pop A from the parse stack and consume t. If the terminal has a value, store it in the matched-terminal variable.
      2. Else we have a token mismatch, so fail.
    • A is a non-terminal.
      If table entry [A, t] contains a rule A := Y1 Y2 ... Yn, then
      1. Pop A from the parse stack. Push Yn Yn-1 ... Y1 onto the parse stack, in that order.
      2. Else there is no transition for this pair, so fail.
    • A is a semantic action.
      Pop A from the parse stack and apply it to the semantic stack
      .
    until the parse stack is empty.
  4. Return the value on top of the semantic stack.

Notice that we now have to distinguish between the two stacks at each step. The only step that refers to the semantic stack is the new step, which applies the semantic action encountered on the top of the parse stack. The action will build a new AST node, perhaps consuming nodes already on the semantic stack, and push the result onto the semantic stack.

Note: In a practical sense, the algorithm is still incomplete. It recognizes failures in the cases of a token mismatch and a missing entry in the parse table, but what other kinds of errors can occur? When and how should the algorithm recognize and signal them?

Once the parsing algorithm is modified, it will work for any parsing table created from a suitable grammar.

Changes to the Language Grammar

Next, we have to make two changes to whatever language grammar we wish to build a parser for.

First, we augment the grammar with semantic actions for creating AST nodes. Given the three kinds of abstract syntax in our expression language, we need to add semantic actions to the three rules that recognize them.

We will place the semantic action at the point in a grammar rule when we have recognized all of the parts needed to construct an AST node. This will often be at the end of a grammar rule, but not always. In this grammar, we have recognized...

We want to build the nodes then, rather than at the end of the rule. This requires us to expand some non-terminals once by hand on the right hand sides of the binary operator rules. (We will consider other possible wrinkles with parsing these expressions next time.)

First, we expand the expression and term symbols on the right hand side of the rules once by hand, to expose the right operands associated with the operators:

expression  := term expression'
expression' := + term expression'
             | ε
term        := factor term'
term'       := * factor term'
             | ε
factor      := ( expression )
             | identifier

Then we add the semantic actions:

expression  := term expression'
expression' := + term make-plus-exp expression'
             | ε
term        := factor term'
term'       := * factor make-times-exp term'
             | ε
factor      := ( expression )
             | identifier make-identifier

Expanding the non-terminals and adding the semantic actions do not affect how we build the parsing table, so it remains the same, with the augmented rules in place of the originals:

. identifier + * ( ) $
E TE' . . TE' . .
E' . +T make-+ E' . . ε ε
T FT' . . FT' . .
T' . ε *F make-* T' . ε ε
F identifier make-id . . ( E ) . .

Finally, we write factory methods for each kind of abstract syntax. This is straightforward enough; these will be constructors for the records or objects that make up our representation. We will talk more about writing code to implement the AST next session.

Watching the New Table-Driven Parser at Work

Let's trace this new version of the algorithm as it recognizes the token stream:

x + y * z

This is the same expression we recognized back in Session 9, with generic identifier tokens replaced with real identifiers. Let's re-label the "Stack" column in our table as the Parse Stack, to reflect the fact that we now have two stacks to consider, and add a new column called Semantic Stack in place of the Rule Matched column.

Parse Stack Input Semantic Stack
$ x + y * z $ .
$E x + y * z $ .
$E'T x + y * z $ .
$E'T'F x + y * z $ .
$E'T' m-id id x + y * z $ .
$E'T' m-id + y * z $ .
$E'T' + y * z $ [x]
$E' + y * z $ [x]
$E' m-+ T + + y * z $ [x]
$E' m-+ T y * z $ [x]
$E' m-+ T'F y * z $ [x]
$E' m-+ T' m-id id y * z $ [x]
$E' m-+ T' m-id * z $ [x]
$E' m-+ T' * z $ [x] [y]
$E' m-+ T' m-* F * * z $ [x] [y]
$E' m-+ T' m-* F z $ [x] [y]
$E' m-+ T' m-* m-id id z $ [x] [y]
$E' m-+ T' m-* m-id $ [x] [y]
$E' m-+ T' m-* $ [x] [y] [z]
$E' m-+ T' $ [x] [ [y] * [z] ]
$E' m-+ $ [x] [ [y] * [z] ]
$E' $ [ [x] + [ [y] * [z] ] ]
$ $ [ [x] + [ [y] * [z] ] ]
[ [x] + [ [y] * [z] ] ]

The main loop of the algorithm terminates when the parse stack goes empty. The algorithm then returns the AST on the semantic stack, which represents the input sequence. The brackets in the semantic stack show the grouping of parts in the AST. The result matches the precedence rules implicit in the grammar.

Notice how the algorithm uses the new matched-terminal variable. In this example, matching an identifier against an x, y, or z requires that parser remember the actual identifier that was matched until it sees a make-identifier semantic action. This will also be true for other terminals with content, such as numeric literals.

We can think of the semantic action operating on a pair: (semantic stack, matched terminal). Some semantic actions, such as make-identifier, ignore the value of the semantic stack because they get the value they need from the matched terminal variable. Other semantic actions, such as make-plus-exp and make-times-exp, ignore the value of the matched terminal variable because they get all the values they need off of the semantic stack.

This tells us something about how to implement semantic actions in our program... Next time!

Closing Exercise

We want the * operator to bind more tightly than the +. In the case of x + y * z, it did. But was that an accident? Maybe it worked properly because the leftmost operator became the root of the tree. Will the parser construct the correct AST for a program in which the operators are swapped?

We know from how the algorithm works that the make-* semantic action will be pushed onto the parse stack first. Will it get buried beneath a make-+ action that gets executed first?

There is one good way to find out: trace the algorithm! So...

Trace the new algorithm as it recognizes the token stream:

x * y + z

After working through this on your own... check out this trace. More next time.