CS 4550 Session 27

Session 27
Introduction to Optimization

Where Are We?

For the last few weeks, we have been talking about the synthesis phase of a compiler, which converts a semantically-valid abstract syntax tree into an equivalent program in the target language. This involves building a run-time system to support the execution of a program in target code, perhaps translating the AST into an intermediate representation such as three-address code along the way, and generating code in the target language.

Most recently, we learned a basic algorithm for generating target code, including register selection and backpatching jump targets. Last time, we learned about a way to improve register selection using next use information. Then we looked at code for a complete code generator, Louden's code generator for "Tiny C".

In this session, we move on to the final topic of the course, optimization. But first, let's take a quick look at a practical matter: giving boolean command-line arguments to a TM program.

Klein Command Line Arguments and TM

Now that you are writing a code generator for the entire Klein language, you face the practical matter of testing boolean arguments to main(). The challenge is that, while Klein supports both integer and boolean values, TM supports only integers.

I wrote this TM program that echoes its argument:

0: LD   0,1(0)
1: OUT  0,0,0
2: HALT 0,0,0

How does it behave with non-integer arguments?

$ tm-cli-go echo.tm 2
OUT instruction prints: 2

$ tm-cli-go echo.tm s
(null) is not a valid command line argument
true, false or integer values are valid arguments

$ tm-cli-go echo.tm true
OUT instruction prints: 1

$ tm-cli-go echo.tm false
OUT instruction prints: 0

So: We can use Klein's boolean literals true and false as command-line arguments to TM. Thanks again to Mike Volz, CS 4550 Class of 2007, for this extension to TM.

On to Optimization

This semester, we have learned basic techniques for building a compiler that converts a semantically-valid program written in a source language into an abstract syntax tree, and then into an equivalent program in a target language. While a source program is usually written best when it communicates well with human readers, an executable program is usually written best when it meets the needs of the target machine, is small, and runs fast. How can we produce such a target program?

a block diagram of a compiler, with three stages of analysis (scanner, parser, semantic analyzer) pointing to the right, an arrow pointing down, and three stages (optimizer, code generation prep, and code generator) pointing back to the right. All stages but optimization are grayed out.

In this session, we consider briefly some of the key ideas in the area of code optimization. Optimization can happen at any stage of a compilation:

optimizing the AST before generating IR
optimizing the IR before generating target code
optimizing target code directly

You may be surprised at how easily we can optimize your Klein-to-TM compiler using these ideas. Even if you do not have time to implement them in your compiler, you can appreciate the ideas and the process.

Optimizing Code

As we have learned TM assembly language and code generation this semester, we've often seen that the TM programs we wrote by hand were shorter and more efficient than the TM programs produced by our code generation techniques. Why is that?

The main reason is that our code generation techniques focus on a single IR statement or a single AST node at a time, while we could keep the entire program in our head at one time and reason about it as a whole.

Ideally, we would like for our compilers to generate target code that is as good as or better than the code we could write by hand. A good optimization stage can help our compiler do better.

What Is Optimization?

The term optimization is a misnomer, really, because showing that a piece of code is optimal is undecidable. More accurately, an optimizer simply improves the code in some objective way.

Actually, we have already improved the quality of target code in an objective way several times during the course. For example, in Session 25, we saw that implementing a better getregister() utility and using a memory map could help the basic code generation algorithm produce more efficient code. However, these kinds of improvement do not change the content of the program itself; they change the internal mechanisms of the compiler. So we would not call them "optimizations".

That said, there is no bright line between the techniques we call optimizations and the techniques we think of as simply writing a better compiler. Perhaps when a technique becomes so standard that "everyone does it", it is no longer considered an optimization.

Some optimizations are independent of the target machine, as they manipulate the computation being done by the program in some way. Others take advantage of the target machine's features to generate code that runs more efficiently on that machine only. Many machine-independent transformations can be done on the intermediate representation, prior to code generation. Machine-dependent transformations are best accomplished when generating target code.

Whatever the nature of an optimization:

It must preserve the meaning of the input program. The modified program should behave in essentially the same way as the original program.
It should improve the speed, the size, or some other feature of the resulting program.
It should be cost-effective. That is, the total value gained from the improvement to the code should exceed the total cost of implementing the optimizer and executing it on programs.

Optimizing Through Profiling

A common approach to designing an optimizer is to identify a set of typical target programs, find the most frequently executed code structures in them, and focus efforts to improve the compiler's performance on these common structures. This echoes the conventional wisdom of the 80/20 Rule, which says that 80% of some cost follows from 20% of the resources.

To implement such an approach, we might use a profiler to execute a program on an accepted suite of data and identify the hot spots in the program.

Finding or creating a widely accepted suite of input programs to use with a compiler can be difficult, though it has been done for many languages, including C, Tex, and Scheme. We have built a suite of Klein programs that can serve as a benchmark for compiler performance.

Optimizing Through Analysis

Another common approach to implementing an optimizer is to:

Do static analysis of the code generated by the compiler, focusing on the flow of data or control through the program. Such an analysis builds a graph that 'chunks' the intermediate representation into blocks defined by transfers of control or changes in data objects.
Apply one or more transformations to the code, motivated by the features of these chunks.

Common Optimizations

Common program transformations for optimization include eliminating subexpressions, copy propagation, eliminating dead code, and folding of constants directly into the code. Code elements that commonly benefit from optimization include loops, function calls, and address calculations.

In our time together, let's consider three common optimizations in some detail:

unrolling loops,
inlining function calls, and
compiling tail calls, especially recursive calls, differently.

We'll look at loops today and functions next time.

Optimizing Loops

Loops are a common target for optimization. A few simple transformations to a loop can produce code that runs much faster on large inputs. The cost of such transformations is that they usually produce larger target programs. One example of this is unrolling a loop.

Consider this counted loop in C:

for (i = 0; i < 1000000; i++) {
  s = s + i*i;
  printf("%d\n", i*i);
}

The loop increments the counter i 1000000 times, branches back to the top of the loop 1000000 times, and checks the exit condition i < 1000000 1000001 times. All of these operations are overhead to the actual work being done: the computing and printing of a running sum.

We can reduce the overhead if we write a longer program, "unrolling" some of the loop iterations into explicit statements:

for (i = 0; i < n/5; i+=5) {
  s = s + i*i;
  printf("%d\n", i*i);
  s = s + (i+1)*(i+1);
  printf("%d\n", (i+1)*(i+1));
  s = s + (i+2)*(i+2);
  printf("%d\n", (i+2)*(i+2));
  s = s + (i+3)*(i+3);
  printf("%d\n", (i+3)*(i+3));
  s = s + (i+4)*(i+4);
  printf("%d\n", (i+4)*(i+4));
}

The unrolled loop increments the counter i only 200000 times, branches back to the top of the loop only 200000 times, and checks the exit condition only 200001 times.

Not surprisingly, the unrolled loop runs approximately 80% faster. Here are two typical runs on my MacBook Pro:

> time for-loop > /dev/null
real    0m0.235s
user    0m0.082s
sys     0m0.009s

> time for-loop-unrolled > /dev/null
real    0m0.170s
user    0m0.027s
sys     0m0.005s

Redirecting standard output to /dev/null saves us from having to see all of the output. It also saves the time of writing all of the output to the screen. It does not affect the magnitude of the speed-up.

Ordinarily, the tradeoff for this optimization is an increase in the size of the generated code. But check this out:

-rwxr-xr-x  1 wallingf  staff     33432 Dec  2 11:22 for-loop
-rwxr-xr-x  1 wallingf  staff     33440 Dec  2 11:22 for-loop-unrolled

No change! Can you imagine why?

What happens, though, if we unroll the for loop all the way? Try it! If you do, try gcc -w to suppress warnings related to

On my machine, the executable runs slower than the original loop, and the compiled code is over 4000 times larger. Can you imagine why the code runs slower?

Modern pipelining processors make the benefits of unrolling a loop harder to pin down. gcc has flags to control its loop-unrolling behavior and many other optimizations, but its default behavior is often what most programmers want most of the time.

The two smaller for-loop programs are in today's zip file, along with a Python program that generates the statements of the completely unrolled loop. (I did not type that one by hand!) If you want that program, too, download it separately: for-loop-unrolled-all.c . It's big enough — nearly 58 MB — that I won't saddle you with it unless you want it.

Wrap Up

Klein doesn't have a loop construct, so this optimization can't help us. Next time, we turn our attention to Klein's most important feature, the function call, and learn how we might optimize the code our compiler generates — which, ironically, might involve generating a loop!

Session 27 Introduction to Optimization