CS 4550 Session 2

Session 2
A Simple Compiler

Where Are We?

Last time, we considered a compiler first as a black box and then as two-stage process of analysis and synthesis. We then decomposed the process into six generic stages and walked through each component in just enough detail to get a sense of what sort of processing each stage performs. That discussion described what a compiler does, but not how it works.

Today we turn our attention to how a compiler does its job and look at the complete implementation of a simple compiler. In preparation for our discussion, you read the README for the compiler, which included

the grammar for the source language,
sample source programs and instructions for generating target programs, and
a high-level view of the program's architecture.

and then studied the code for the compiler itself. Be sure to ask questions as we discuss the compiler now.

In today's discussion we may see ideas that are new to you. Don't worry that you don't understand them all yet. The goal of today, as last time, is to give you a broad sense of the ideas and issues we will deal with throughout the course. Beginning next time, we begin our deep dive into the first stage of the compiler.

A Simple Compiler

This compiler was motivated by an article called "The Fastest Fizzbuzz in the West" (no longer on the web), in which the author defined a new language, Fizzbuzz, that he could use to solve the mythical interview question of the same name:

The idea is this: why keep writing Fizzbuzz over and over? We write a program in the new language:

1...15
fizz=3
buzz=5

We compile this program to target code that generates the desired output. But, with a new programming language, we can solve any variation of the problem in just a few lines of code. In the mythical interview, you can also demonstrate just how much computer science you know.

I decided to write a compiler for Fizzbuzz in standard Python. The result is a compiler that is simple enough to read in one sitting but complex enough that it can also be used to demonstrate more sophisticated techniques throughout the course. For today, I asked you to study this compiler. The starting point for your study was the README file.

The source language, Fizzbuzz

A Fizzbuzz program consists of two parts: an integer range and a sequence of word/number pairs. As a result, the language has only a few features:

two data types, integers and words, and
two operators, = and ...

Here is the grammar of the language, as defined in the README file:

    program ::= range assignments
      range ::= number ELLIPSIS number
assignments ::= assignment assignments
              | ε
 assignment ::= word EQUALS number
       word ::= WORD
     number ::= NUMBER

The ε indicates nothing, an empty string. NUMBER, WORD, EQUALS, and ELLIPSIS are all terminals that correspond to the items described informally above.

The target language, Python

The compiler generates standard Python code. For now, this shields us from having to learn details of a lower-level target language, but it also hides from us some of the beauty of a more complex code generator. Fortunately, the semantics of the Fizzbuzz language make it possible for us to explore lower-level concepts later in the course.

Going Deeper with the Compiler

The Fizzbuzz program to play the canonical Fizzbuzz game looks like this:

1...15
fizz=3
buzz=5

We compile it in this way:

$ python3 compiler.py programs/default.fb

... to produce this Python program:

for i in range(1, 16):
    output = ""
    if (i % 3) == 0:
        output += "fizz"
    if (i % 5) == 0:
        output += "buzz"
    if output == "":
        output += str(i)
    print(output)

We execute the compiled program directly like this:

$ python compiler.py programs/default.fb | python

Or save the generated code to a file and run that file:

$ python compiler.py programs/default.fb > fizzbuzz.py
$ python fizzbuzz.py

One way to study Fizzbuzz's syntax of in more detail is to use its grammar to derive the program defined in the source code:

program
range assignments
number ELLIPSIS number assignments
NUMBER ELLIPSIS number assignments
NUMBER ELLIPSIS NUMBER assignments
NUMBER ELLIPSIS NUMBER assignment assignments
NUMBER ELLIPSIS NUMBER word EQUALS number assignments
NUMBER ELLIPSIS NUMBER WORD EQUALS number assignments
NUMBER ELLIPSIS NUMBER WORD EQUALS NUMBER assignments
NUMBER ELLIPSIS NUMBER WORD EQUALS NUMBER assignment assignments
...
NUMBER ELLIPSIS NUMBER WORD EQUALS NUMBER WORD EQUALS NUMBER assignments
NUMBER ELLIPSIS NUMBER WORD EQUALS NUMBER WORD EQUALS NUMBER

For the rest of this session, let's turn our attention to the implementation of the compiler. You can follow along in the Python source code. This implementation is proof that a programmer can hack up a compiler for a simple language in relatively short order. It is not the best coding example, but it is a simple example to get us started.

Consider structure of the repo: source, tests, doc.

Some things to note about the structure of the compiler:

It consists of four stages, not six.
- There is no optimization stage.
  (What would that look like for Fizzbuzz?)
- There is no preparation for code generation.
  (Fizzbuzz is simple, and Python is a high-level target.)
It uses two supporting data structures (token and AST nodes), with another embedded in the code (symbol table).
This compiler is parser-driven: it speaks only to the parser and never to the scanner. The parser asks the scanner for tokens as it needs them.
The compiler contains a set of custom error classes, which give it...
- the ability to provide typed error messages
- the ability to extend errors with more content, behavior
Finally, there is a decent set of unit tests, using PyUnit.
- It's all talk until the tests run. — Ward Cunningham, a progenitor of JUnit, design patterns, and wiki.
- They are executable documentation.

The `Token` Class

The job of the scanner is to transform the stream of raw characters that make up a Fizzbuzz program into a sequence of tokens. We define a Token data structure to support the interface between the scanner and the parser.

Our compiler has unique tokens for each of the two operators, both kinds of value, and an end-of-file marker. The last is a convenience for processing the input downstream from the scanner.

A token is implemented as a pair, or more generally a tuple, consisting of:

type — a flag (a named enumeration)
value — one of three possible (integer, string, or none)

Here are some thoughts I have about this implementation:

using enumerations for simple named types
blurring identity and value
the filename...

The `Scanner` Class

An instance of this class turns numbers and individual characters into tokens.

Why bother? My parser would be almost as simple without the scanner, and the data being processed would be more compact. One reason is separation of concerns. If we want to broaden the concrete syntax of Fizzbuzz — say, to allow more kinds of characters in identifiers — we can modify the scanner and token classes without having to modify the parser, which deals with syntax and is complex enough on its own.

Another is pluggability. Next week, we will learn how to write a cleaner, sounder scanner. When we do, we can plug our new scanner into the Fizzbuzz compiler, and it will work.

The public interface of the Scanner class consists of two methods. nextToken() is the primary method by which the parser requests the next token. The scanner also allows the parser to peek() ahead to the next token without consuming it, as a convenience. This implementation uses a Token instance variable to hold the "peeked" token and return it when requested via a nextToken() message.

Why might the parser want to peek() ahead? We'll see soon!

The helper method getNextToken() is the function that actually reads a token. Some implementation notes:

Some of the tokens are literals and can be recognized as one or three specific characters. The rest need to be built one character at a time until we reach a delimiter.
We need is_whitespace(char) because the standard methods in Python allow other non-visible characters, such as Unicode, which are not allowed in Fizzbuzz's specification.

getNextToken() and its helpers are written in an "ad hoc" fashion, that is, without following any formal mechanism. But they were written in a principled way, following the pattern of the regular expression for Fizzbuzz tokens. For more complex grammars, we will want to use the regular expression in a more formal way. Consider even the relatively simple case of the + and ++ operators...

Another reason to write a scanner is that we can use it independent of the parser. To help us visualize the token stream in a Fizzbuzz program, I wrote a little program called print_tokens.py. It does nothing more than create a scanner, ask it to produce the tokens in a program, and print them. This program helps us see that our scanner is doing what we expect it to do!

The `AbstractSyntaxTree` Class

The job of the parser is to determine whether a stream of tokens satisfies Fizzbuzz's grammar. In doing so, it transforms the token stream into an abstract syntax tree. We define an AST_Node data structure to implement the output of the parser.

If you have studied programming languages, you may recall that we will define one element of abstract syntax for each arm of the language grammar that has semantic content. For Fizzbuzz, this includes a program, a range, and an assignment.

Quick Question: I use a list of assignment nodes to represent an "assignments" node, rather than make a separate class for the collection. What trade-off does this make?

Each of these elements captures only the essential information in the corresponding part of the source code. For example, an Assignment_Node does not need to record the character = that separates the word and its number, because that is implicit in the element itself. It does need to record the word being declared and its associated number, because they vary from one Assignment_Node to the next.

Some implementation notes:

The public AST_Node class serves as a generic type for all ASTs. It is an abstract superclass that allows our compiler to operate on any AST_Node without knowing its concrete class.
There is one (hidden) class for each element type. These classes record the essential information about each semantic element in the program.

The `Parser` Class

The parser transforms individual tokens into meaningful elements of the program, represented as a tree.

In general, we construct parsers guided by the language grammar. There are a number of techniques for doing this, each working for a particular set of grammars. Fizzbuzz's grammar is quite simple, but it is representative of the grammars for many languages that are more complex. The Fizzbuzz compiler uses a standard technique for processing grammars of this type, known as recursive descent.

A recursive-descent parser descends through the derivation tree that it recognizes as it examines tokens, making recursive calls to recognize sub-elements. We write one parsing procedure for each non-terminal in the language grammar.

Look at the grammar now...

For Fizzbuzz, this means that we will write a parsing procedure for a program, a range, a set of assignments, and so on. Each of these procedures recognizes a non-terminal of the corresponding type by determining whether one of the non-terminal's rules matches the sequence of tokens being processed.

To implement a parsing procedure in this way, we can write:

a conditional statement to determine which grammar rule to apply, and
code to recognize the elements on the right-hand side of the selected grammar rule. This can match terminals directly but must recur on non-terminals.

The only grammar symbol in Fizzbuzz that requires a conditional choice is assignments, which can match a single assignment or end-of-file.

Our example program above illustrates all of the key features of recursive descent, including peeking ahead to see that a list of assignments has ended.

The reason the parser peeks ahead is that, in recursive descent, a parser must never backtrack, because it doesn't remember where it came from or what state it was in there. Whenever two rules might apply, it must be able to predict correctly which rule does apply. For this reason, the ε arms of the grammar can create a challenge.

Fizzbuzz's parser solves the ambiguity problem by looking ahead one token, using the scanner's peek() method. A compiler might look ahead more than one token, but that is extremely rare in practice. Other parsing techniques support backtracking, with the trade-offs on performance that you might expect.

If the parser encounters an illegal or otherwise unexpected token in the stream produced by the scanner, it generates a ParseError. More generally, a parser throws an exception whenever it cannot find a production rule in the grammar that matches its input.

My approach to building the parser was somewhat ad hoc, which works fine for a small, simple grammar. In this course, we learn several principled approaches for building parsers that scale better to larger, more complex languages.

Semantic Analysis

The analysis phase of a compiler ensures that the program conforms to the language's definition. The various stages of analysis approach this problem at different levels.

Scanning verifies the program at the lexical level, in terms of characters and meaningful "words". Parsing verifies the program at the syntactic level, in terms of the program structures defined in the grammar. Each of these steps works locally, examining only adjacent elements as defined by the definition of tokens and the grammar of the language.

Semantic analysis verifies any element of the language definition that applies across structures. We can think of it as a global validation of the program. Semantic analysis typically deals with issues of declarations, scope, types, and type-specific operations, including issues with operator overloading. These all deal with consistency among parts of the program.

For Fizzbuzz programs, semantic analysis must verify that

the range makes sense,
the assignments use legal numbers, and
the words are unique.

The first two of these tasks are done by the TypeChecker itself, while the third is assisted by the creation of a symbol table.

The Type Checker

The TypeChecker examines each node in the AST to ensure that it makes sense in the context of the other nodes.

The Symbol Table

The symbol table is a Python dictionary. It allows us to find out if a word is used in more than one assignment statement. In more complex compilers like the one you build this semester, later stages will use the symbol table to do substantial work, in particular during register allocation and code generation.

The `CodeGenerator` Class

After the type checker ensures that the program satisfies the type rules of the language and builds a symbol table, CodeGenerator is ready to transform the annotated abstract syntax tree into code in the target language.

In many compilers, actual code generation is preceded by a step that allocates registers to specific variables and expressions and creates a memory map for program values. However, this compiler produces Python target code, which allows the code generator to work at a much higher level.

The public interface of a code generator consists of a single method, generate(ast, symtable). This method generates code in three parts:

a prologue that produces the control line of a for statement,
a "middle" that produces the various cases of the Fizzbuzz printer, and
an epilogue that produces the print statement that finally generates Fizzbuzz's output.

The last of these is simple enough to do in place; the first two are implemented in the helper methods generate_prolog() and generate_cases().

Some implementation notes:

generate_cases() generates the most lines of code as it creates an if statement and iterates over the list assignments to create its cases.
The various generate_* functions use templates to create the target code they are responsible for. For a more complex target language, we might want to have a more sophisticated way to select and instantiate the code templates, or perhaps not use templates at all.
In a more object-oriented implementation, the classes for our tree nodes, Program_Node, Range_Node, and Assignment_Node, might have generate() methods to generate code for themselves. This would decentralize code generation to the point that our code generator class would serve only as a bridge between the compiler and the objects that generate the code.

Wrap Up

There it is: an entire compiler, from beginning to end, in one session. It is simpler in almost every way from a compiler for a more complete source or target language, yet it is complex enough to reward real study. Please ask any questions about what this compiler does, how it does it, or why.

This concludes our introduction to the course. We are ready to move on to consider each of the stages of a compiler in much more detail — and how to write one. Next session, we begin at the beginning, with lexical analysis.

Note that Homework 1 will be available soon. A quick note about submission...

I ask for two "plain text" files.
The assignment will be due at 11:59 PM.
Try to submit early, and feel free to re-submit at any time, The system will overwrite an existing file with the new file.

We will talk more about the project next week, including project teams. A quick note on implementation languages and on working with a team:

teammates (confidence)
commitment
investment
working together: meetings, (a)synchronous, pairing

CS 4550: Translation of Programming Languages

Session 2
A Simple Compiler

Where Are We?

A Simple Compiler

Introduction

The source language, Fizzbuzz

The target language, Python

Going Deeper with the Compiler

The `Token` Class

The `Scanner` Class

The `AbstractSyntaxTree` Class

The `Parser` Class

Semantic Analysis

The Type Checker

The Symbol Table

The `CodeGenerator` Class

Wrap Up

Session 2 A Simple Compiler

Where Are We?

A Simple Compiler

Introduction

The source language, Fizzbuzz

The target language, Python

Going Deeper with the Compiler

The Token Class

The Scanner Class

The AbstractSyntaxTree Class

The Parser Class

Semantic Analysis

The Type Checker

The Symbol Table

The CodeGenerator Class

Wrap Up

Session 2
A Simple Compiler

The `Token` Class

The `Scanner` Class

The `AbstractSyntaxTree` Class

The `Parser` Class

The `CodeGenerator` Class