CS 4550 Session 16

Session 16
Semantic Analysis

Opening Exercise

Here is a Klein function used in two "excellent number" programs [ 1 | 2 ], which occasionally comes in handy:

function length(n: integer): integer
  if (n < 10)
     then 1
     else 1 + length(n / 10)

It is syntactically and semantically correct. The parser tells us that it is syntactically correct, but last time we saw four different ways that a function like this can be wrong:

gives the wrong type of value to a primitive operator
passes the wrong type of argument to a function
refers to a variable that doesn't exist
returns a value of the wrong type

List at least four more things we have to check in order to verify that this code is semantically correct.

If you can't think of four, consider issues that can arise when there are multiple formal parameters or when a file contains multiple functions.

Each of things on your list points to a test case: we can change one token or one grammatical unit of the program and have a new program that is semantically incorrect but still passes the parser.

Semantic Features to Check

Including the four kinds of error we saw last time, here are a few errors that we could find in a function definition like this one:

a function returns the wrong type of value
an expression uses a variable that doesn't exist
an expression gives the wrong type of value to an operator
a function call passes the wrong types of arguments
a function call passes the wrong number of arguments
the condition on an if expression is not a boolean
the then and else clauses of an if expression have different types (?)

(We will consider the last item again later.)

If we step up to functions with multiple parameters and programs with multiple functions, we also have:

two or more formal parameters on the same function have the same name
an expression calls a function that does not exist
two or more functions have the same name
there is not function named main
there is a function named print

If we loosen our definition of "wrong", there can be even more:

presence of a variable that is never used
presence of a function that is never used
presence of an unnecessary function
presence of a code path that can never execute
presence of code that never terminates

(That last one can be hard to recognize...)

Where Are We?

The parser can guarantee a syntactically correct program, but it cannot guarantee a semantically correct one. You may recall at least two reasons for this from our discussion of syntactic analysis. First, programs are context-sensitive objects, but we use a context-free grammar to model the language. Second, we occasionally leave even some context-free properties out of the grammar in order to simplify the parsing process.

So: we must check the abstract syntax tree produced by the parser to verify that it is semantically correct. We refer to this step as semantic analysis. The same sort of analysis can also add other kinds of value by helping the programmer to make the program better.

a block diagram of a compiler, with three stages of analysis (scanner, parser, semantic analyzer) pointing to the right, an arrow pointing down, and three stages (optimizer, code generation prep, and code generator) pointing back to the right. All stages but semantic analysis are grayed out.

Last time, we briefly considered the task of semantic analysis. We saw that this stage of the compiler has two primary goals:

to ensure semantic correctness, and
to prepare the abstract syntax tree for code generation.

and can provide a third service:

to inform the programmer of things that are legal but perhaps unintended.

We then took a quick look at some of the issues involved in semantic analysis to ensure correctness.

Today, we will consider briefly the other kinds of semantic analysis that a compiler can do and return our attention to ensuring correctness, specifically checking type correctness.

As we discuss semantic analysis, you may want occasionally to ask yourself, How many of these features does Klein have? The answer will give you an idea of what your semantic analyzer for Klein will need to be able to do.

Beyond Program Correctness

Semantic analysis can achieve more than simply ensuring that a program in semantically correct.

Semantic Analysis to Help the Code Generator

The second goal of semantic analysis is pragmatic: to prepare the abstract syntax tree for code generation. To satisfy this goal, the semantic analysis phase usually produces two kinds of output:

It adds information to the AST that makes it easier to optimize the program and generate target code. A common annotation to the AST is the addition of type information about identifiers and expressions to the corresponding nodes in the tree.
It produces other artifacts that can support the rest of the compiler. One such artifact is a symbol table that records information about the identifiers used throughout the program.

This analysis is not necessary in order ensure that the program satisfies the language specification, but it can make later stages of the compiler more effective.

For example:

If the semantic analyzer can determine that a value will be an integer rather than a floating-point, then the code generator can select more efficient assembly language instructions.
If the semantic analyzer can determine that a value is constant rather than variable, then the code generator can store the value in a register and reuse it.

Semantic Analysis to Help Programmers

In addition to these two primary goals, semantic analysis can help programmers in other ways. Consider this Klein function:

function MOD(m: integer, n: integer): integer
  m - m*(m/m)

This program passes the parser and a type checker, but is it semantically correct? Perhaps the programmer intended to use the n but made a mistake. Perhaps the programmer would like to delete the argument's second function. Or perhaps this is exactly what the programmer intended! In any case, a semantic analyzer can recognize anomalous code and let the programmer know about it.

Semantic analysis can check features that are desirable but not strictly necessary to a program's correct execution. For example, it might:

identify "dead code", which can never be reached
identify variables that are never used
point out more idiomatic usage, such as the use of i++ instead of i=i+1

The last of these gives rise to an entire class of tools: static analyzers that check style, portability, and idiom. The first and best known program of this sort is Lint, which was created at Bell Labs in the 1970s to flag "suspicious" C code and report potential portability problems. Lately, I have been applying the Python linter pycodestyle to most of new Python programs in an effort to learn standard Python style (and to break my mind out of stylistic blinders).

Tools such as Lint and pycodestyle can be built for any language, even Klein. They can identify bad style and other kinds of non-standard code. This sort of semantic analysis can be built right into a compiler, but it is often built in to editors and IDEs or done by stand-alone tools.

A Moment in History
Speaking of Lint: October 8 was the anniversary of Dennis Ritchie's death. Ritchie created the C programming language and wrote the first C compiler. As we have discussed a few times, C is the foundation for much of the work in the world of compilers, if only because most compilers are initially written in, or compile to, C. In addition, C and Unix (which Ritchie co-created with his lab partner, Ken Thompson) set the stage for open systems and portable software. Lint was written by one of Ritchie's colleagues at Bell Labs.

We can take this idea one step further. One of the common uses of semantic analyzers in contemporary programming environments is in tools that support automated refactoring. Even the simplest refactorings — say, renaming a variable or a method — require semantic analysis to ensure that the new name does not create a conflict with an existing name in the same scope. A semantic analyzer can check for conflicts that human programmers might miss, especially in large code bases.

Implementing Semantic Analysis

In this course, we focus our semantic efforts on verifying type information and building the symbol table. These can be done using straightforward structural recursion over the abstract syntax, either pre- or post-order traversal.

Many other static checks can be implemented using the same technique and can even done at the same time as checking types or building the symbol table. For example, the compiler can verify uniqueness of names at the time each entry is made in the symbol table.

Let's now explore some of the key ideas and techniques behind type checking. We will use the length function:

function length(n: integer): integer
  if (n < 10)
     then 1
     else 1 + length(n / 10)

and its AST to illustrate the ideas:

Quick review question: Now that the parser has produced an abstract syntax tree, we apply semantic analysis to the AST, not the source code or sequence of tokens. Why?

Type Checking

A type checker verifies that the type of some program component matches what the program expects where the component occurs. Here are some examples of expectations that must be verified:

Operators expect arguments of a specific type.
Function calls require a specific number of arguments, each of a specific type.
Assignment statements require that the value match the type of the variable being assigned.
Only an array variable can be indexed.
Only a pointer can be dereferenced.

Type checking can also be of assistance to the code generator. Many target languages, including assembly languages, support different operations for similar but different types, such as integers and real numbers. Knowing that a particular expression is an integer means that the compiler can generate code using the more efficient integer operation.

When an operator such as + can be used with arguments of different types, we say that the operator is overloaded. You may be familiar with overloading from languages such as Java and C++. For example, in Java, + works on strings as well as numbers.

Not only do these languages include overloaded built-in operators, but we can also write methods of the same name that take different types and numbers of arguments. For example, in Java, a class can have multiple constructors, as long as each has a unique argument signature.

In C++, programmers can even overload built-in operators by writing methods for their classes. For instance, a class for rational numbers might support addition using the same + operator as all other numbers. Or a list class might support + for concatenation in the same way as a string. Semantic analysis can identify the context in which an operator or function operates and record that information for later use.

To build a type checker, we use:

information from the language's grammar, which specifies the syntactic constructs that can appear in a program, and
rules for assigning types to each construct.

For example, the Java language specification says that, when both operands to a binary arithmetic operator are integers, the type of the result is also an integer. This kind of rule points out that every expression has a type associated with it.

The Java spec also tells us that we can create an array of values by following a type name T with []. The result is a new type, array of T. This kind of rule points out that types have structure, because they can be constructed out of other types.

Type Expressions

In many languages, a type can be basic or constructed.

A basic type is one provided as a primitive in the language, such as int, char, and boolean.
A constructed type is one created by the programmer, either implicitly or explicitly, out of other types.

For example, an array is typically a homogeneous aggregate of other, and its type reflects that. In languages that support explicit pointers, such as C and Ada, a pointer is a type constructed from another type, too. Just as we can create an array of T for some T, we can also create a pointer to T.

We can also think of the signature of a user-defined function as specifying a type. For the Klein length function:

function length(n: integer): integer

A call to length produces an integer value for use in the calling expression. The function header also creates an expectation for the call: that it passes an integer as its only argument. We might think of this function as having the type:

integer → integer

Compilers that need to reason about higher-order functions, and even verify their types, use function types of this sort. Haskell and Scala are languages that do amazing things with function types, including inferring automatically the types of untyped expressions. But even compilers for more conventional languages such as Java or even Klein can use function types effectively to verify that calls to a function are legal.

Because a type can now be more than just a name, we need to think more generally about type expressions. We will associate a type expression with each language construct that can have a type: identifiers and expressions.

We can specify type expressions more explicitly using this inductive definition:

A basic type is a type expression.
Basic types are those provided as primitives in the language. Different kinds of languages offer different kinds of basic types. Typical basic types include integer, char, and boolean.
A type created by applying a type constructor to one or more types is a type expression.
Arrays and pointers are examples of constructed types with which you are likely familiar. We explore type constructors in more detail in a reading assignment and in the next session.
A type name given to a type expression is a type expression.
In C, we can create a struct consisting of parts and name it account, so account is a type expression.
```
typedef struct {
  int x, y;
} Point;
```
In Ada, we can create new type names explicitly in a program, such as naming an array of numbers Hours. This is a type expression, too.

Much more is possible. For example, a type expression might contain variables whose values are type expressions. This is true for generic data types in Ada and Java, and in C++ templates. Consider this C++ template:

template <class T>
T max( T a, T b )
{
    return (a > b) ? a : b;
}

The type of max is (T, T) → T, where T is a type variable. In this expression, T behaves as a variable ordinarily does — it has the same value in all three occurrences.

This sort of definition specifies a family of types that can be instantiated at compile time. In languages other than C++, Java, and Ada, we could imagine instantiating the type expression at run-time. Consider the type of a Racket function like map...

Finally, when we implement a compiler, we often use a special basic type expression error to indicate mismatches that arise in type checking.

Type Systems

A type checker assigns a type expression to each expression in a program. The set of rules the checker uses to assign these types is called a type system.

A compiler or another program can implement any type system, even one different from the one specified by the language definition itself. Consider:

Some compilers provide non-standard features. For example, some Pascal compilers allow a program to pass an array to a function without specifying its index set. This is less strict than the Pascal language definition, so we call such a compiler permissive.
Some tools impose style filters on code. As we discussed earlier, lint checks a C program for potential bugs and non-idiomatic style. This can be more limiting than C's language definition. We call such a tool strict.

Of course, we know that a compiler may not do any type checking. However, the fact that the programs in a language are not type-checked at compile time does not mean that the language does not have a type system.

Any feature that can be verified statically can also be verified dynamically, at run time, as long as the target code carries with it the information needed to perform the check.

We don't always cover the following in class.

For example, each object in a program might use a few bits to record its type. At run-time, the interpreter can check types before applying operators, calling functions, or sending messages. This is how languages such as Racket, Scheme, and Smalltalk work. They are strongly typed, but dynamically typed. +

The converse is not true. There are some checks that can be done only dynamically in many languages. Consider this Java-like example:

char[] table = new char[256];
int    i;

// ... later:

foo( table[i] );

The compiler cannot guarantee that the attempt to access the array table[i] will succeed at run-time, because it cannot guarantee that i will lie in the range [0..255]. A similar problem arises if we fix the range of i but allow the program to assign an arbitrary array to the variable (which is true of Java arrays). In a language such as Ada, programmers can specify data types much more rigorously, which enables the compiler to enforce the definition strictly.

The compiler may be able to provide some help by doing data-flow analysis, another form of static analysis, to infer more about the values that a variable might take. Data-flow analysis can uncover a lot of information about a program, but it cannot check every case that we might like.

What Next?

Our definition of type expressions and our catalog of type constructors give us the tools we need to do type checking. As we will see, most complications in type checking result from constructed and named types. We will pick up our discussion of these issues in type checking next session.

Session 16 Semantic Analysis