Session 22
Data Abstraction and Variety

The Infinite Variety of Implementations

Usually, we think of data structures as having alternative implementations. But even atomic types can be represented in a variety of ways.

Consider one of the simplest data types of all: the non-negative integer. Non-negative integers can be defined with an "interface" of four parts:

Every integer has a successor; zero doesn't have a predecessor. +

We can implement this interface directly in Racket using Racket's own numbers:

(define zero     0)
(define is-zero? zero?)
(define next     add1)
(define previous sub1)

Using this interface, we can express the number two as (next (next zero)).

But we can also implement this interface just as easily using a Racket list:

(define zero     '())
(define is-zero? null?)
(define next     (lambda (n) (cons 1 n)))
(define previous rest)

With this implementation, we still express the number two as (next (next zero)), just as we would in the number-based implementation. The underlying value is represented differently — (1 1) rather than 2 — but its meaning relative to the other operations would be the same.

Check out the code for these implementations — and maybe even try to create your own. There are many more ways. We computer programmers are an ingenious lot. Computation is a flexible medium!

An Opening Exercise

As a warm-up exercise, I would like for you to brainstorm as many different implementations as possible for another simple data type: the pair.

List all the different ways you can think of implementing pairs in Python.

A pair consists of two values. We define a pair with an interface consisting of three functions:
  • The operation MAKE-PAIR is a constructor. It takes any two arguments and returns a new pair.
  • The access procedure FIRST takes a pair as an argument and returns the first part of that pair.
  • The access procedure SECOND takes a pair as an argument and returns the second part of that pair.
For example:
> (define pair1 (MAKE-PAIR 2 3))
> (define pair2 (MAKE-PAIR 1 pair1))
> (FIRST  pair2)
1
> (FIRST (SECOND pair2))
2
> (SECOND (SECOND pair2))
3
You should be able to come up with two or three based on data types you used in your Intro course, and maybe more.

If you run out of ideas, list ways that you might do this in Java or some other language — even Racket!

Some Possible Implementations of Pairs

The number of ways to implement a pair in Racket is probably larger than you first imagine. I can think of two ways using data types that you have been using all semester:

Back in Session 4, you learned about and had a reading assignment on Racket vectors. We can implement a pair as a vector with two slots:

(define (MAKE-PAIR a b) (vector a b))
(define (FIRST  aPair)  (vector-ref aPair 0))
(define (SECOND aPair)  (vector-ref aPair 1))

If you started thinking about data structures in other languages, you might have listed a Python dictionary or a Java map. Back in Session 4, I also mentioned that Racket has a hash table:

(define (MAKE-PAIR a b) (hash 'first a 'second b))
(define (FIRST  aPair)  (hash-ref aPair 'first))
(define (SECOND aPair)  (hash-ref aPair 'second))

Or you may have thought of a C struct. Racket has a struct datatype, too:

(struct pair (one two))       ; a structure with two fields
(define MAKE-PAIR pair)       ; a Racket-generated constructor
(define FIRST     pair-one)   ;   and Racket-generated accessors
(define SECOND    pair-two)   ;   named by struct and field

If you thought of Java, you might have thought using a class, which is pretty similar to a struct. Racket has classes and objects, too, so we could use the same idea in Racket.

But Wait, There's More...

What other values have we used this semester?

Functions. Lots and lots of functions. In Racket, functions are values, too. Is it possible to implement a pair as a function?

What would this mean? The constructor MAKE-PAIR would have to return a function as its value. The accessors FIRST and SECOND, which operate on pairs, would receive a function as their argument.

Indeed we can implement a pair as a function! Here are three ways.

We could make the pair a selector function.

(define (MAKE-PAIR a b) (lambda (selector)
                        (if selector a b)))
(define (FIRST  aPair)  (aPair #t))
(define (SECOND aPair)  (aPair #f))

This approach uses boolean values in addition to functions, as well as an if expression.

We could use message passing to simulate how objects work. This generalizes the idea of a selector function to allow different (and more) arguments.

(define (MAKE-PAIR a b) (lambda (selector)
                        (cond ((eq? selector 'first ) a)
                              ((eq? selector 'second) b))))
(define (FIRST  aPair)  (aPair 'first))
(define (SECOND aPair)  (aPair 'second))

This approach uses symbols, and symbol equality, in addition to functions, booleans, and an if.

Both of these solutions use functions in conjunction with another data type to implement a pair. Can we implement a pair using only functions?

We can. This implementation uses nothing but functions:

(define (MAKE-PAIR a b) (lambda (proc) (proc a b)))
(define (FIRST  aPair)  (aPair (lambda (x y) x)))
(define (SECOND aPair)  (aPair (lambda (x y) y)))

I love this last solution. Whenever I see it, I smile. It hints at how much one can do with nothing but functions.

The lambda calculus underlies most programming language theory and inspired the creators of Lisp, Scheme, Racket, and many other other languages. It relies solely on function definition, function application, and variable substitution to do all of its computation. It does not even use boolean values or an if statement, which seem to be at the core of every programming language we know. Maybe those things aren't really essential after all?

Run the Code

This file contains all eight Racket implementations of pairs shown above. Try them out!

And, so that you know this isn't just a strange phenomenon available only in Racket, here are two implementations of the pair in Python, including the pure function implementation... (Remember, Python has lambdas, too.)

PS: There Is Always More

Of course, if we think a little harder, we can probably find cool ways to use Racket's other primitive types, such as numbers, strings, and symbols, to encode a pair. Those implementations will take a little bit more effort — and code. They also might not be as general.

Indeed, in this session three years ago, one of the students (thank you, Henry!) asked if we could implement a pair using the set data type that we implemented in Homework 7 and used in Sessions 19-20.

I did not know... At first, it seemed impossible, because a pair is ordered and sets are unordered. Even so, while the students worked on Quiz 3, I worked on this challenge as my quiz. It turns out that it is possible! If you'd like to see how, see this implementation.

One moral of this story is:

Do not assume anything about the implementation of an interface — even the simplest interface!

Some of these implementations might be outside the scope of your imagination just yet. The pure functional implementation probably is. I hope that our study of data abstraction will stretch our minds to a point where these don't seem so strange. Note in particular that we will use the 'message passing' approach above to implement object-oriented programming in Racket.

Setting The Stage

For the last six sessions, we have been exploring the idea of syntactic abstractions, those features of a language that are convenient to have but not essential to the language. We considered several examples: local variables, local functions, non-if selection statements, logical connectives, and — most recently — variable names. Our goal in studying these syntactic abstractions was not to study Racket per se but to see why and how language interpreters provide such abstractions. Indeed, you can identify many abstractions in other languages you know.

Beginning with this session, we move on to another sort of abstraction that all languages provide: data abstraction. We will introduce the idea of data abstraction by returning to an idea you know well: data types and their implementations.

Data Abstraction

Programming requires two kinds of abstraction.

A syntactic abstraction offers a different way to express behaviors. It does not add to what can be expressed in the language, but it does add to what can be expressed conveniently.

A data abstraction offers a different way to express values. Usually, a data abstraction allows you to both represent and manipulate the data. In practice, these data are often aggregate values, but that is not necessarily true, as we saw at the beginning of the session.

As with syntactic abstractions, a data abstraction does not add to the set of problems that can be solved in a language. It makes some solutions easier or more convenient to create.

We have already used one data abstraction extensively in this course: Racket's list, which is constructed out of the more fundamental type, the pair. Racket lists are implemented in terms of another data type, so they are a data abstraction. Technically, we don't need lists, but they make our jobs more convenient. Each list is built out of a sequence of pairs (primitive cons cells) and the empty list (the null pointer). The language provides an interface that, for the most part, hides from us the underlying data representation.

When we build a list incorrectly, Racket occasionally reminds us that lists are built out of pairs, by showing "dotted pair" notation when displaying the structure. We can even use dotted pair notation ourselves to express lists and other structures:

> '(1 . (2 . (3 . ())))
'(1 2 3)

> (cons 1 (cons 2 3))
'(1 2 . 3)

Of course, the idea of constructing one type out of another is not peculiar to Racket. Can you think of an example from some other language you know?

The one that comes immediately to my mind from your core CS courses is the Java ArrayList. An ArrayList is constructed out of an array, which is a more fundamental type. Java compilers know about arrays, but they don't have to know much about ArrayLists. They do know how to manipulate classes as abstractions, though, and so they can compile and manipulate ArrayLists.

Indeed, every user-defined class in Java is a data abstraction implemented in terms of other values. This holds as well for the classes defined as a part of the Java programming language. Object-oriented programming is a style in which programmers create data abstractions that make it more convenient to write some solutions — and to maintain and extend them over time.

Another prominent data abstraction from another language is C++'s class construct. As designed, C++ programs were to be compiled by C compilers, even though C compilers do not recognize classes. Instead, the C++ pre-processor would translate any code that creates and uses classes and their instances into equivalent C code. The pre-processor translates all classes into equivalent C structs. So, while the primitive term class is a syntactic abstraction, it is also a data abstraction.

The Racket list is a powerful data structure, due largely to its flexibility, but it is also rather inefficient when it comes to accessing elements. We cannot easily or efficiently access any item directly; instead, we must step through items one at a time. Knowing that a list is really a linked structure of cons cells gives this fact away, because we know from our data structures course that linked structures provide O(n) access time.

Racket provides another data aggregate, the vector, which we have used only occasionally to date. Vectors are a primitive data type. We will soon put vectors to more frequent use. You should find that vectors feel rather familiar, based on your programming experiences in other languages. At this point, though, you will want to refresh your memory about vectors by reviewing the Racket Guide, paying particular attention to:

In your data structures course, you learned about hash tables and how to implement them using arrays. We could do the same thing using Racket vectors, but the language creators have saved us the effort by providing this data abstraction as a primitive.

Data Abstractions as Sugar

In our discussion of syntactic abstractions, we saw that we could define a syntactic abstraction such as let, write a program that translates expressions in a language containing let expressions into expressions in a language without let expressions, and our interpreter wouldn't know the difference. We might like to be able to do that with new data abstractions, too. Sometimes, we can.

A great example of programmers adding their own data abstraction to a language is Generic Java. Those of you who know Java or Ada have learned about generic data types. C++ offers the same capabilities with its template facility. In the beginning, Java did not provide generic class definitions in any way. This caused programmers a lot of grief, in in particular all the downcasting we had to do when retrieving objects from a container, such as a Vector.

Well, some programmers solved the problem for themselves by adding generics to the language as a data abstraction. They defined new syntax for expressing generic classes such as Vector<String>, and then wrote a preprocessor that translated code containing generic classes into regular Java.

Eventually, the team in charge of Java decided to add generic classes to the language. When Java 1.5 arrived in the summer of 2004, it included generic types as a part of the language, based on the Generic Java extension. The programmer-added feature became a language feature. Java's implementation of generic types is a great example of data and syntactic abstraction.

Wrap Up