CS 4550 Session 29

Session 29
Optimization and Programming

And You Thought Module 4 Was Bad

TypeScript has a 33,836-line type checker. It's so big — 1.92 megabytes — that Github refuses to display it.

Optimization Redux

Last week, we considered a few ways in which a compiler can optimize the code it generates. As we learned, an optimization is nothing more than a program transformation that trades one cost for another, while preserving the meaning of the program. A compiler can make these improvements at multiple levels: source language, AST, IR, or even target language. Most optimizations work best at the AST level or in a well-designed IR.

Over the last two sessions, we have considered three classes of optimization:

optimizing loops, in particular by unrolling for loops into repeated statements and then modifying how while loops branch
optimizing function calls by inlining the function body in place of the call, and
optimizing tail recursive calls by converting the call into a goto.

Handling recursive calls in tail position properly offers an inordinate payoff in languages such as Klein, which depend so heavily on function calls for iteration.

The run-time advantage of properly handled tail-recursive calls is so great that programmers sometimes choose to rewrite programs that are not tail-recursive so that they are.

Consider an old standard, factorial:

function factorial(n: integer): integer
  if (n < 2)
    then 1
    else n * factorial(n-1)

This function waits until n becomes 0, and then starts a chain of multiplications at 1 that unwinds all of the recursive calls. In a language with loops, we would implement a much more efficient solution that iterates from n down to 0. We can do the same thing in a functional languages by making factorial tail recursive:

function fact_aps(n: integer, acc: integer): integer
  if n = 0
    then acc
    else fact_aps(n-1, n*acc)

Instead of saving the multiplication to be done after the called function returns, this function does the multiplication right away and passes the partial result along as a second argument.

As we saw last time, this function is, or can be, a loop. It's a simple conversion:

L1:
IF n != 0 THEN GOTO L2
T1 := acc
GOTO L3
L2:
T2 := n - 1
T3 := n * acc
PARAM T2               → n = T2
PARAM T3               → acc = T3
T1 := CALL fact_aps 2  → GOTO L1
L3:
RETURN T1

Writing this version of the function requires the programmer to think differently, but in languages like Klein we often find ourselves writing tail-recursive functions as a matter of course. Consider this function, which I wrote as part of a program to compute the average digit of a Klein integer:

function average_digit(n: integer, sum: integer, i: integer): integer
  if n < 10
    then print_and_return(sum+n, i+1)
    else average_digit(n/10, sum + MOD(n,10), i+1)

It implements something like a while loop to count the digits in a number and sum them up along the way. I didn't try to write a tail-recursive function; it's the most natural way to write this code in Klein.

Sometimes our compilers optimize code for us, and sometimes we write code that runs quickly already.

Non-recursive Tail Calls

But it gets even better... Consider this Klein function from the sieve of Eratosthenes, which is not tail-recursive:

function sieveAt(current: integer, max: integer): boolean
  if max < current
    then true
    else doSieveAt(current, max)

Though sieveAt is not recursive, its "tail position" consists of a function call. The result of that call will be returned as the value of the call to sieveAt.

In this case, though, the called function's activation record does add value to the computation. It is the calling function's activation record that is no longer needed. It is simply a placeholder used to pass through the result of the called function to the caller's caller.

In this case, it is possible to pop sieveAt's stack frame before the invocation of doSieveAt, rather than after sieveAt returns. This idea generalizes the idea of "eliminating tail recursion" to the proper handling of tail calls.

In calling doSieveAt, the target code for sieveAt can:

prepare the arguments for doSieveAt,
create doSieveAt's activation record, setting the control and access links in to point back to sieveAt's caller,
discard its own activation record,
place doSieveAt's activation record on top of the stack,
and invoke doSieveAt with a unconditional jump rather than a procedure call.

When doSieveAt returns, control will pass directly to code>sieveAt's caller.

If a function invokes itself in a tail position, then tail-call optimization becomes a tail-recursion optimization.

There are two steps to implementing this technique:

detecting the tail call and
implementing the more efficient code.

Both are relatively straightforward to implement. It is remarkably valuable for optimizing programs written in a functional language, or in a programs written in an imperative language using a functional style.

Some functions that don't look like they make tail calls actually can. Consider that this Klein function:

function horner(x: integer, n: integer, value: integer): integer
  if (GE(n, 0))
    then horner(x, n - 1, (value * x) + coefficient(n))
    else value

is equivalent to this one:

function horner(x: integer, n: integer, value: integer): integer
  if n < 0
    then value
    else horner(x, n - 1, (value * x) + coefficient(n))

Very nice. How easy do you imagine it is for static analysis of a function to detect such a case and convert it to the equivalent tail-call function?

This all sounds so good, why would anyone not optimize tail calls away? First of all, these transformations may impose trade-offs that the programmer does not want to make. Handling tail calls in this way saves space on the run-time stack, but for a variety of reasons it can lead to slower programs at run time.

Secondly, and more important, there are issues that complicate the proper handling of tail-calls, which make it difficult or impossible to implement correctly. For example, in languages such as Java and C++, these include:

the order in which arguments are evaluated. What happens if overwriting one argument changes the computation of another?
the invocation of constructors and destructors of objects in dynamic memory.

Interested in learning more about tail call optimization?

These articles talk about tail call optimization in JS6, the most recent version of Javascript:

This discussion thread includes the source of one of my examples and gives some examples of why imperative languages such as C and C++ cannot eliminate tail calls, due to side effects in argument calculations, constructors, and destructors.

The Other Side of Optimization

The last two examples show that programmers sometimes implement optimizations by hand. In the old days, programmers occasionally unrolled loops by hand in order to improve performance. In my work with Klein, I have frequently converted if statements to boolean expressions, folded two function calls into one by substituting values by hand, and so on. I've also gone the other direction. But I would certainly like for my compiler to optimize the code it generates so that I don't have to worry about such matters.

Or do I? Can we take optimization too far?

Consider the case of the C strlen() function we encountered earlier. In order to make that call constant and lift the invariant code out, we would have to tighten C's language spec in several ways. Not everyone wants to give up the freedom that C gives us. The point is more general than C, though, as this Hacker News comment says:

And here is where the "optimizing compiler" mindset actually starts to get downright harmful for performance: in order for the compiler to be allowed to do a better job, you typically need to nail down the semantics much tighter, giving less leeway for application programmers. So you make the automatic %-optimizations that don't really matter easier by reducing or eliminating opportunities for the order-of-magnitude improvements.

The writer of that paragraph is not a low-level C programmer: he is a hard-core Smalltalk programmer.

As one compiler writer says, implementers of optimizing compilers face a lot of challenges dealing with programmers (emphasis added):

It's important to be realistic: most people don't care about program performance most of the time.

This author suggests that we keep our compilers and optimizers simple and powerful — in the small.

But let's not lose sight of how wonderful we programmers have things now. Most compilers, from LLVM and the standard Java compiler to compilers for languages such as Rust, Haskell, and Racket optimize code so well that we often need not think much about the speed or size of our compiled programs: they are already amazing. And that is a tribute to the code generators and optimizers in the tools we use, and to the researchers and implementers who have helped create them.

Examining Code Efficiency in TM

How can we tell if an optimized TM program is better than the original?

One way is to examine the size efficiency of our compiler by looking at the size of the code it generates:

the number of statements in the generated .tm file.
Statements in TM assembly language include line numbers, so we don't even have to count.
the number of bytes in the compiled code.
Before doing this, though, we would want to remove all the comments from the generated code. This does not give us a whole lot of new information over the number of statements, because all TM instructions have basically the same length.

We can also examine the time efficiency of our compiled code. Two possible metrics are:

the number of statements executed at run time.
The TM simulator reports the number of statements executed on each program run. (You can use the p command to toggle that option off and on.)
the amount of time needed to run a program.
To examine our generated code's run time at this level of granularity, we need a way to capture the clock time of a program's execution.

To support the last of these, former students and I have extended the TM simulator with a clocking mechanism that reports actual run time in the simulator, in milliseconds. You can download this version of the simulator from the project page.

In general, some machine instructions instructions take longer to execute than others. Capturing simulator run-time provides a proxy for that. For a more accurate representation of execution time, we would want to count execution time differently for the different kinds of instruction, say, RM instructions versus RO instructions and MUL/DIV instructions versus ADD/SUB instructions. We could build profiles for these instruction types based on an actual machine and then create a TM simulator capable of reporting that data. +

Fortunately, we have a suite of benchmark programs...

Extending Klein

As I've mentioned in class this semester, I enjoy programming in Klein. My last two forays, over Thanksgiving break, were a program to compute the series 1/4 + 1/16 + 1/64 + ... for a given number of terms and a program to demonstrate a feature of certain fractions with repeating decimals.

I like to program.

Many programs would be easier to write if Klein had a for loop, though writing tail-recursive code becomes a matter of habit. But some problems really demand an array.

Klein is small, designed specifically for a one-semester course in writing a compiler. To use it for a wider range of problems, we would need to extend it.

Consider this list of features we might add to Klein:

a for loop or a while loop
an array data type
local variables
an assignment statement

What would you have to change in your compiler to implement this feature? Do we need more knowledge, or do we need only more time?

The language feature I most covet, I think, is an import feature, such as:

from [file] import [function]

How hard would such a feature be to implement? We have all the machinery we need: open the imported file; scan, parse, and semantic-check it; copy the requested function's AST into the AST of the importing file. We could create a simple preprocessor to do this. Or we could create a preprocessor to find and substitute the text of a function for the text of the import statement, and then let the usual compilation process take place.

Time or knowledge? It may be surprising to you just how useful the knowledge you already have is. Time is the more limiting factor for adding many new language features.

We will revisit this question in an exercise at the beginning of our next session.

Finishing the Project

Module 7 is the final stage of the project. It asks you to put your compiler into a final professional form, suitable for users outside of this course.

Make sure your system meets all functional requirements. Check all previous project feedback and address any issues. Be sure that your compiler catches all errors, whatever their source.

Make sure your project directory presents a professional compiler for Klein programmers to use. The README file should suitable for new users unfamiliar with this course. It should describe the tools it provides and explain how to build and run them. It should also provide a high-level explanation of the contents of the subdirectories. You do not have to list or explain every file in the folder.

There should not be any extraneous files in the directory: no config files for VS Code or any other tools you used as developers. This software is for Klein programmers. The files in the doc/ folder should have clear, consistent names.

You also have an opportunity to improve the system in some way, such as an optimization or a better component for one of the existing modules. This step is optional and will earn extra credit.

Evaluation

Recall how grading works for the project:

6 stages  x  20 points each = 120
1 project x 130 points      = 130

So, 250 points for the compiler itself, plus 25 points for the presentation and 25 points for project evaluations = 300 points total.

The 130 points for Project 7 looks a bit odd, but keep in mind that this involves evaluating the entire compiler again, from the scanner through the code generator, at 20 points per stage. All of the bugs you fixed in previous stages are now gone, so you have an opportunity to score (close to) 20 on those earlier stages. The extra ten points are for the state of the project directory.

If you submit an optimizations after the deadline, you will receive extra credit. I'm a nice guy, and I love optimizations. Early in finals week is the latest we can go.

Session 29 Optimization and Programming