Session 7
Introduction to Syntactic Analysis

Opening Exercise: Finding Tokens

We have been talking about scanning, and today we talk about parsing. So perhaps you won't mind scanning and parsing a little Java for me.

What is the output of this snippet of code?

How many tokens does the Java scanner produce for Line 6? Line 8?

01    int price = 75;
02    int discount = -25;
03    System.out.println( "Price is " + price + discount + " dollars" );
04    System.out.println( "Price is " + price - discount + " dollars" );
05    System.out.println( "Price is " + price + - + discount + " dollars" );
06    System.out.println( "Price is " + price + - - discount + " dollars" );
07    System.out.println( "Price is " + price+-+discount + " dollars" );
08    System.out.println( "Price is " + price+--discount + " dollars" );

If an expression causes an exception, write ERROR for that line and assume that the rest of the code executes normally.

See Code Run

Let's compile the code and run it:

$ javac StringCatTest.java
StringCatTest.java:8: error: bad operand types for binary operator '-'
    System.out.println( "Price is " + price - discount + " dollars" );
                                            ^
  first type:  String
  second type: int
1 error

Ah, an error on Line 04 of the code. Comment it out, compile, and run:

03 | Price is 75-25 dollars
04 | ERROR
05 | Price is 7525 dollars
06 | Price is 75-25 dollars
07 | Price is 7525 dollars
08 | Price is 75-26 dollars

The first thing to recall is that Java's binary + is overloaded: it performs both numeric addition and string concatenation. When the left operand to + is a string, Java's parser chooses the string concatenation operator. (Well, that's close enough for now.)

The second thing to recall is that both + and - are also unary operators in Java. In Lines 5-7, these operators apply to the value of discount, and the resulting values are printed.

From our discussion of scanning, you should understand why the + that follows price in Line 7 is treated as an operator. Like Klein operators, Java operators are self-delimiting, so the scanner does not require that they be surrounded by whitespace.

The scanner knows only that that + is an operator. How does the parser know to treat it as a binary operator, not unary? This unit will help us understand how.

The third thing to recall, from as recently as our last session, should help you figure out why the Line 8 generates a different result than the Line 6... The scanner matches the longest token possible, and -- is also a unary prefix operator in Java!

What happens if we parenthesize the price op(s) discount expressions? Arithmetic ensues.

Do we face any issue like this in Klein? We do. While Klein does not have + or -- as unary prefix operators, it does have a unary prefix - to go with the usual binary operators + and -. That means that these are legal Klein expressions, with or without whitespace:

    x + - y
    x - - y
  - x + - y
  - x - - y
- - x - - y

but not   x - + y. I just gave you a few test cases for Project 2! Some of these cases are also ripe for simple optimizations, too...

Feel free to play around with the the original Java program and its parenthesized twin. They are in today's code zip file.

Ironically, these programs grew out of a discussion Prof. Schafer and I had many years ago, after one of his Intro to Computing students asked a question in class — the day before we began our discussion of parsing in this course. The timing was perfect then and perfect now, given that you are writing a scanner and asking questions about negative numbers and expressions without whitespace in Klein. Of course, the CS1 students probably did not appreciate the subtlety of the lexical issues as well you all do.

The Four Levels of Success

As you write your scanners, please keep in mind that there are many levels of success in programming. Here are four levels of increasing success I will consider as I review your code:

  1. does not compile (or loads with errors)
  2. compiles (or loads silently), but tests break
  3. tests run but fail
  4. tests pass

In an ideal world, all of the tests pass. This can be a challenge when we are writing a large, complex program. We might not even have thought of all the cases we need to test So, we work to make our program a little better each day.

However, we should under all circumstances strive to produce code that compiles, or loads silently, and runs the tests to completion. If our code does not compile, then it has lexical, syntactic, or semantic errors in it. If the tests break, then our code does not satisfy the minimal specification of the task.

When we first learn to program, Level 1 feels like success. However, reaching Level 3 is, as a matter of professionalism, the minimum bar. It shows that we care about our code. Reaching Level 4 is how we succeed.

Module 1: Your Scanner for Klein

How are things going? It is possible! Your state machines will show you the way. Testing the code will help you find bugs that sneak in when our state machines are missing something important. When I wrote my scanner, some tests reminded me that EOF delimits a token, too — but it is unlike other delimiters, which we may need to be part of the next token requested.

To run your scanner, I will cd into your project directory, follow your instructions to build the compiler, and then type kleins /full_path _to_a_klein_program/. Be sure that this process works before you submit your final version of Module 1.

$ kleins euclid.kln

That output is different from the output your scanner will produce; more on that soon.

A couple of notes on your scanner:

What happens if we run the scanner on a program that contains a lexical error?

$ kleins error_case.kln

The client program catches the error and fails gracefully. Your kleins should, too.

Token Frequency Analysis in Klein

So, I created this thing...

While implementing my scanner, I decided to write a more interesting client program to exercise it. I was curious about which tokens occur most often in Klein programs, and what the relative counts were. If I were inclined, I could use this information to improve my scanner.

So I wrote token_frequencies.py. Its tally_from(scanner, count_of) function takes two arguments, a scanner on a source file and dictionary of token-type counts, and updates the dictionary based on the tokens in the source file. The rest of the program is machinery to accept one or more Klein source filenames on the command line and to count the tokens in all of those files.

When I run my token analysis program on the current set of files in the Klein collection, I get:

$ python3 token_frequencies.py klein-programs/* | sort -r
1614 TokenType.IDENTIFIER
545 TokenType.TYPENAME
545 TokenType.COLON
514 TokenType.COMMA
512 TokenType.RIGHT_PAREN
512 TokenType.LEFT_PAREN
286 TokenType.INTEGER
183 TokenType.FUNCTION
  96 TokenType.THEN
  96 TokenType.IF
  96 TokenType.ELSE
  63 TokenType.PLUS
  60 TokenType.DIVIDE
  57 TokenType.TIMES
  54 TokenType.MINUS
  53 TokenType.LESS_THAN
  52 TokenType.EQUALS
  18 TokenType.BOOLEAN
  14 TokenType.OR
  12 TokenType.AND
  8 TokenType.NOT

We notice some good things here: the left and right parens balance, and every if has a matching else. So I haven't made those obvious lexical errors in my Klein programs. Some of the data — say, the low number of boolean operators or the roughly equal numbers of arithmetic operators — may be a result of my style, though some of the programs were produced largely by former students.

How might we use this information to make a smarter or more efficient scanner?

The Context

We now turn our attention to syntactic analysis. Recall the context:

a block diagram of a compiler, with three stages of analysis (scanner, parser, semantic analyzer) pointing to the right, an arrow pointing down, and three stages (optimizer, code generation prep, and code generator) pointing back to the right.  All stages but syntactic analysis are grayed out.

The scanner provides a public interface that allows clients to consume the next token in recognizing an abstract syntax expression and to peek at the next token without consuming it. Keep in mind: The parser never sees the text of the program, only the sequence of tokens produced by the scanner.

Regular Languages Are Not Enough to Model Syntax

In our study of lexical analysis, we used regular languages as a theoretical model for the "concrete syntax" of a programming language, its tokens. Regular languages are the subset of context-free languages that do not allow recursion. A regular language can be defined using only repetition of a fixed structure. That suffices for concrete syntax, which has a relatively simple structure.

But the abstract syntax of programming languages requires the full power of a context-free language. Recursion is essential. Consider even this grammar for simple arithmetic expressions:

expression := identifier
            | expression + expression

We cannot use substitution to eliminate the recursion, because each substitution of expression in the + arm creates a longer sentence — with more occurrences of the symbol that is being replaced. Repetition is not enough, because this is not a regular language.

Or consider this little grammar:

expression := identifier
            | ( expression )

How can a DFA that accepts this grammar know that an expression has the same number of ) on the right as it has ( on the left? We can hardcode any specific number n by adding 2n+1 states to the DFA, but this grammar allows any number of parentheses. No finite state machine can be enough.

We saw in our study of lexical analysis that the set of languages recognized by a finite state machine is equivalent to the set of regular languages. So it shouldn't surprise us that, if a regular expression is insufficient for modeling abstract syntax, then a DFA is insufficient for implementing a recognizer for abstract syntax, too.

Recursion is the strength of the context-free languages. Can they serve as our theoretical model for abstract syntax, or do we need more power?

Context-Free Grammars Are Not Enough to Model Syntax, But...

BNF notation is the standard notation for describing languages, and the set of languages describable using BNF is equivalent to the set of context-free languages.

What does context-free mean here? The context in which a symbol appears does not affect how it can be expanded. In our first grammar above, each occurrence of expression in the second production rule of can be replaced by the right hand side of any production rule for expression, regardless of which symbols come before or after expression in the containing expression.

In our second grammar, each occurrence of expression in the second rule of can be replaced with another parenthesized expression.

This freedom makes context-free grammars a tempting solution for describing the syntax of a programming language. However, there are certain constraints we want in our languages that cannot be expressed in a context-free way. Here are two:

These rules are sensitive to the context in which the constrained token appears. For a call to a function with two arguments to be legal, the function must be declared elsewhere to accept two arguments. A grammar that captured such a constraint would have to have rules with more complicated left hand sides. Consider:

 S := sAt
    | xAy

sA := Sa
    | b

Ay := SaS
    | b

The left hand side of the second rule requires that it be used to replace As only in conjunction with preceding s's. The third rule applies only when an A is followed by a y.

In a context-sensitive language, the left hand side of a rule can specify context information that restricts a substitution. This enables us to represent more elements of a programming language than a context-free language, including the practical constraints listed above.

Unfortunately, processing a context-sensitive grammar is considerably more complex and considerably less efficient than processing a context-free grammar. When writing compilers, these costs are not worth the benefits they bring, because they apply to only a few isolated elements of programming language syntax.

So: we will use a two-part strategy to recognize programs at the syntactic level:

This means that, strictly speaking, the set of programs accepted by a context-free parser is a superset of the set of valid programs. An invalid sequence of tokens can pass the parser successfully.

Building a Machine that Recognizes Programs

xkcd No. 754: Dependencies

When we move from describing lexical information to describing syntactic information, we generalize from regular languages to context-free languages.

Likewise, when we move from recognizing tokens to recognizing sentences, we generalize from a finite-state machine to a set of finite state machines. Each machine can call the others — and itself. That's recursion!

In order for this process to work, the state of the calling machine must be resumed when the called machine terminates. To do this, the parser must save the state of the calling machine. The machine calls are nested, which means that a stack can do the trick.

Just as regular languages find their complement in finite state machines, so, too, do context-free languages find their complement in pushdown automata. Because a pushdown automaton provides an arbitrarily deep stack, it is able to handle nested expressions of arbitrary depth.

Like FSMs, the topic of pushdown automata is understood so well that the theory of computation can help us write programs that recognize context-free languages. We call these programs parsers. The same theory that helps us write parsers also enables us to generate parsers automatically from grammars.

Deriving a Program Using a Grammar

We use BNF to write a context-free grammar for a language. The terminals of the grammar are tokens, and the non-terminals are the abstract components of programs.

You are working closely with such a context-free grammar as you build your compiler, that of the Klein programming language. As another example, take a look at the grammar for Python 3. Notice the format of this grammar: it can be read by a Python program! (You might consider creating a "readable grammar" for Klein...)

There are several ways to think about at how a grammar defines the sentences in its language. One particularly useful way is derivation, in which we construct a sequence of BNF rule applications that leads to a valid sentence. Derivation is closely related to the substitution model used to evaluate expressions in functional languages such as Racket.

Derivation treats the grammar's production rules as rules for rewriting an expressions in an equivalent form. Consider this more complete arithmetic expression grammar:

1  expression := identifier
2              | expression operator expression
3              | ( expression )
4              | - expression

5  operator   := + | - | * | / | ^

We can use this grammar to derive the sentence -(identifier + identifier) in this way:

expression => -expression                           using rule 4
           => -(expression)                         using rule 3
           => -(expression operator expression)     using rule 2
           => -(identifier operator expression)     using rule 1
           => -(identifier + expression)            using rule 5a
           => -(identifier + identifier)            using rule 1

The theory underlying context-free grammars and derivation is understood well enough that we can do induction proofs to show that:

We will use these ideas to drive our construction of parsers. They will use derivation of a sentence as a model.

At each point in a derivation, we generally face two choices:

  1. which non-terminal to replace
  2. which rule to use when doing so

In the derivation of -(identifier + identifier) above, I always chose the leftmost non-terminal as the symbol to replace. The result is called a left derivation.

Similarly, I could have generated a right derivation by always choosing the rightmost non-terminal to replace:

expression => -expression                          using rule 4
           => -(expression)                        using rule 3
           => -(expression operator expression)    using rule 2
           => -(expression operator identifier)    using rule 1
           => -(expression + identifier)           using rule 5a
           => -(identifier + identifier)           using rule 1

Does it matter which of these selection rules we use, or if we use some other rule? Let's see.

Parse Trees

We can use derivation to generate a tree that shows the structure of the derived sentence. This is called a parse tree. The parse tree for -(identifier + identifier) looks like this:

a parse tree of -(identifier + identifier)

Parse trees are useful as intermediate data structures in syntax analysis and also as the motivation for parsing algorithms.

Quick Exercise: Draw the parse tree for a left derivation of this sentence:

identifier + identifier * identifier

Here's a solution:

                                    expression
                                    /    |     \
                          expression operator expression
                              |          |   /    |     \
                          identifier     +  /     |      \
                                            /      |       \
                                    expression operator expression
                                        |         |         |
                                    identifier    *     identifier

I created it by expanding the left expression using Rule 1 of the grammar.

expression => expression operator expression       using rule 2
           => identifier operator expression       using rule 1
           => identifier + expression              using rule 5a
           ...

But wait... Here is a parse tree for another left derivation of the same sentence:

                                    expression
                                    /    |     \
                          expression operator expression
                          /    |     \   |         |
                        /     |      \  *     identifier
                        /      |       \
                expression operator expression
                    |         |         |
                identifier    +     identifier

I created this one by expanding the left expression using Rule 2 of the grammar, instead of Rule 1.

expression => exp op exp                           using rule 2
           => exp op exp op exp                    using rule 2
           => identifier op exp op exp             using rule 1
           => identifier + exp op exp              using rule 5a
           ...

Recall that, at each point in a derivation, we may have two choices to make: which non-terminal to replace and which rule to use. The left/right derivation process fixes the choice of non-terminal, but we still have to choose a rule to use. Sometimes, more than one rule applies.

We now see that the choice of rule does matter. Different rules can produce different parse trees. We can also see the value of using a tree to record a derivation: it demonstrates the structure of a derivation more readily than a text derivation.

When the same sentence has two different but valid parse trees, we say that the grammar is ambiguous. Theoretically, this sort of ambiguity may not matter much, but practically it matters a lot.

In our next session, we will consider why ambiguity matters and then look at ways to eliminate ambiguity from a grammar before building a parser.