CS 3540 Writing an Interpreter

More Thoughts on Building a Language Interpreter

This reading discusses some ideas that come up often when we discuss how to implement a language interpreter.

Processing Programs

Terminology

You began to write your own language interpreter on Homework 9, for a simple language for creating and manipulating numbers. An interpreter takes a program as input and produces as output the behavior of the program. At this point, our language is purely functional, so a program's behavior is its value.

But you did more than write just an interpreter. You wrote several other functions. Let's consider the full scope of what you have done.

We are now able to talk about multiple languages ...

the source language: Boom. This is the language that programmers use to write programs to solve their problems.
the implementation language: Racket. This is the language we use to implement the language processor (the interpreter, the compiler, the IDE, etc.).

... and multiple kinds of program:

a source program: a Boom program written to solve a problem.
the program processor: your Boom evaluator, including the pre-processor. It processes a source program written in Boom.

We have not yet written a read-eval-print loop (REPL) for Boom yet, but when we do, it will use your other program processors to implement a more complete interpreter. The stages of a REPL include:

read a source program
- parse the program into some structure
- preprocess the structure to eliminate syntactic abstractions
evaluate the program
print its result

This is a pipeline of sorts. Pipelines are the way we do functional programming in a Unix shell.

Parse

This stage converts a program from its concrete syntax to an abstract syntax. In most languages, even Racket, the internal representation of a program is different from the surface syntax we see in Racket code.

We do not need to write a parser. We use parenthesized expressions and values that Racket knows about, so we can rely on Racket's reader to read and parse an expression into a Racket list. That simplifies our job considerably.

However, our syntax procedures simulates abstract syntax. Our Boom-processing functions do not have to care about the concrete of Boom, only the essential parts of the expressions. There should be no firsts or rests in your language processing code except in the syntax procedures themselves.

Preprocess

This component takes an input program from the full language grammar and translates into a program that uses only the core features of the language. Writing a preprocessor allows programmers to use syntactic sugar without cluttering the rest of the interpreter with syntactic abstractions.

We implement the preprocessor using structural recursion. The task is to translate expressions from one grammar into expressions from a smaller grammar. We have to handle all sub-expressions, so the preprocessor is recursive. If one or more of the cases is long or complicated on its own, we can write a helper function to do the work.

Evaluate

We have several choices to make when writing the evaluator.

the structure of the evaluator: a familiar friend. Follow the BNF!
the decision whether to create types for numbers and other base types that are our implementing language already has
the decision of how to implement the functionality of operators, under the same conditions

There are at least three ways to implement the behavior of Boom operators, with varying trade-offs.

embed the behavior directly into the eval-exp function
write a Racket function for each operator at the top level, and call these from the evaluator
write a Racket function for each in a local namespace, and call these from the evaluator

We should try to balance the forces of our situation, with a bit of an eye on the future. How closely do Racket semantics match Boom semantics? Will we want to add new operators? Will we want to change the implementation of existing operators? For example, we may want to add local variables to the language, or static data types, or functions, or ....

We will be extending our language and its interpreter for the next two weeks. Show care in creating good code, not just code that scores homework points. You will be glad you did, both for future assignments and for the sense of accomplishment you will have when you have created a real program.

Next Steps

Homework 10 adds local variables to the language. To do this, we need an environment. Each new scope creates a new environment that extends (and shadows) the existing environment. This requires a change to the evaluator: the interface procedure must create an initial environment, and every call to the evaluator requires that we pass the environment to be used to evaluate the expression. (This may remind you of lexical-address and its initial variable table.)

On Homework 11, we will use the evaluator's environment to add mutable data to Boom, which will allow us also to add sequences of statements. Time permitting in class, we will show how how we can use the environment to add functions to the language. One idea can give us a lot of power.

Note. This is the end of the reading. What follows are two footnotes that were linked to above.

Footnote: Concrete Syntax

Throughout this course, we have been writing most of our programs in a particular style, which is sometimes called syntax-directed programming. This name comes from the idea that the BNF description of a data structure's syntax guides us in structuring our code. In syntax-directed programming, we associate a particular behavior with each part of a grammar.

This should sound familiar! It is the whole basis for the technique of structural recursion that we first learned about in Session 9. As you might guess, this style of programming is especially useful for implementing programs such as compilers and interpreters.

Up to now, we have focused our attention on concrete syntax — that is, the program form designed for the human programmers who use the language. Concrete syntax worries about the correct placement of semicolons and periods and parentheses, the exact format of the if statement, and so on.

For example, in Session 12 we began to work with a little language to which we later added if and let expressions. This little language has a concrete syntax based on Racket's, down to the parentheses. My decision to use a Racket-like concrete syntax was purely pragmatic: it allows us to use Racket's primitive procedures for lists when processing our expressions. You might have preferred a different concrete syntax.

While the issues involved with the language's concrete syntax are quite important to programmers when they write programs, the concrete syntax can get in the way when we are trying to understand the meaning of a program. For example, consider the form of the if statements in a few languages:

if <test>                         ;; Pascal
   then <consequent>              ;;   statements are
   else <alternate>               ;;   semicolon-separated

(if <test>                        ;; Racket
   <consequent>                   ;;   expression
   <alternate>)

if (<test>)                       ;; Java/C/C++
   <consequent>                   ;;   statements are
else                              ;;   semicolon-terminated
   <alternate>                    ;; 

<test> ifTrue:  [<alternative1>]  ;; Smalltalk
       ifFalse: [<alternative2>]  ;;   expression

Even though these statements are written in different ways, they are all the same at a deeper level — at the semantic level, in terms of what they mean. When we devote all of our attention to the concrete form of expressions, we can easily lose sight of their content. This is true when we are writing a program in the language, and it is also true when we are trying to implement a language processor.

As in any programming endeavor, we would like for our language processors to exhibit a separation of concerns. The part of our code that must know about the form of expressions should be distinct from the part that knows about the meanings of expressions. That way, we can change the concrete syntax of a language — even add syntactic abstractions! — without having to modify the interpreter.

Further, with proper separation of concerns, we can also change the semantics of a language without changing the parts of our programs that process concrete syntax. While this is much less likely to happen (why?), it does happen. Consider the addition of auto-boxing to Java 1.5....

[ return to discussion of parsing ]

Footnote: Abstract Syntax

Abstract syntax expresses the content of an expression in a language, minus the syntactic details of the concrete syntax. To create an abstract syntax from a concrete syntax, we associate a name with each arm of the concrete syntax definition and with each of the non-terminals. Each arm is called a production rule, because it defines a rule for producing a legal expression.

As an example, consider the little language that we have studied over the last few weeks. We might express the abstract syntax for this version of our lambda-based language as follows. The names to the right are the production names and the values in parentheses are the names of the non-terminals in the order of their appearance.

CONCRETE SYNTAX                          ABSTRACT SYNTAX
---------------                          ---------------
<exp> ::= <number>                       literal-exp  (datum)
        | <varref>                       variable-exp (var)
        | (lambda (<var>) <exp>)         lambda-exp   (formal body)
        | (<exp> <exp>)                  app-exp      (operator operand)

Notice that I have extended the grammar that we've used in the past to include numeric literals.

The most common method for representing the abstract syntax of an expression is the abstract syntax tree (AST), also sometimes called, inaccurately, a parse tree. (You can learn the difference — and much more — if you take the compiler course next semester.) The interior nodes of an AST are labeled with production names, and the leaves correspond to the terminals that make up the expression. For example, the abstract syntax tree for (lambda (f) (f (f 10))) is:

a tree with a 'lambda' node at the root — the abstract syntax tree for `(lambda (f) (f (f 10)))`

ASTs are n-ary trees, because some expressions have more than 2 parts. For example, the expression (if f x y) has three elements in its abstract syntax: the variable expressions f, x, and y.

[ return to discussion of parsing ]