Session 22
Data Abstraction and Variety
The Infinite Variety of Implementations
Usually, we think of data structures as having alternative implementations. But even atomic types can be represented in a variety of ways.
Consider one of the simplest data types of all: the non-negative integer. Non-negative integers can be defined with an "interface" of four parts:
- the value zero
- a predicate for determining if an integer is zero or not
- a function for finding an integer's successor
- a function for finding an integer's predecessor
Every integer has a successor; zero doesn't have a predecessor. +
This interface is a subset of Peano's axioms. They are part of an ambitious project to define number theory in terms of sets and logic.
We can implement this interface directly in Racket using Racket's own numbers:
(define zero 0) (define is-zero? zero?) (define next add1) (define previous sub1)
Using this interface, we can express the number two as
(next (next zero))
.
But we can also implement this interface just as easily using a Racket list:
(define zero '()) (define is-zero? null?) (define next (lambda (n) (cons 1 n))) (define previous rest)
With this implementation, we still express the number two
as (next (next zero))
, just as we would in
the number-based implementation. The underlying value is
represented differently — (1 1)
rather
than 2
— but its meaning relative to
the other operations would be the same.
Check out the code for these implementations — and maybe even try to create your own. There are many more ways. We computer programmers are an ingenious lot. Computation is a flexible medium!
An Opening Exercise
As a warm-up exercise, I would like for you to brainstorm as many different implementations as possible for another simple data type: the pair.
A pair consists of two values. We define a pair with an interface consisting of three functions:
- The operation MAKE-PAIR is a constructor. It takes any two arguments and returns a new pair.
- The access procedure FIRST takes a pair as an argument and returns the first part of that pair.
- The access procedure SECOND takes a pair as an argument and returns the second part of that pair.
> (define pair1 (MAKE-PAIR 2 3)) > (define pair2 (MAKE-PAIR 1 pair1)) > (FIRST pair2) 1 > (FIRST (SECOND pair2)) 2 > (SECOND (SECOND pair2)) 3You should be able to come up with two or three based on data types you used in your Intro course, and maybe more.
If you run out of ideas, list ways that you might do this in Java or some other language — even Racket!
Some Possible Implementations of Pairs
The number of ways to implement a pair in Racket is probably larger than you first imagine. I can think of two ways using data types that you have been using all semester:
-
... as a Racket pair, of course.
(define MAKE-PAIR cons) (define FIRST car) (define SECOND cdr)
-
... as a Racket list, which is built from pairs.
(define (MAKE-PAIR a b) (list a b)) (define FIRST first) (define SECOND second)
Back in Session 4, you learned about and had a reading assignment on Racket vectors. We can implement a pair as a vector with two slots:
(define (MAKE-PAIR a b) (vector a b)) (define (FIRST aPair) (vector-ref aPair 0)) (define (SECOND aPair) (vector-ref aPair 1))
If you started thinking about data structures in other languages, you might have listed a Python dictionary or a Java map. Back in Session 4, I also mentioned that Racket has a hash table:
(define (MAKE-PAIR a b) (hash 'first a 'second b)) (define (FIRST aPair) (hash-ref aPair 'first)) (define (SECOND aPair) (hash-ref aPair 'second))
Or you may have thought of a C struct
. Racket has
a struct datatype, too:
(struct pair (one two)) ; a structure with two fields (define MAKE-PAIR pair) ; a Racket-generated constructor (define FIRST pair-one) ; and Racket-generated accessors (define SECOND pair-two) ; named by struct and field
If you thought of Java, you might have thought using a class, which is pretty similar to a struct. Racket has classes and objects, too, so we could use the same idea in Racket.
But Wait, There's More...
What other values have we used this semester?
Functions. Lots and lots of functions. In Racket, functions are values, too. Is it possible to implement a pair as a function?
What would this mean? The constructor MAKE-PAIR would have to return a function as its value. The accessors FIRST and SECOND, which operate on pairs, would receive a function as their argument.
Indeed we can implement a pair as a function! Here are three ways.
We could make the pair a selector function.
(define (MAKE-PAIR a b) (lambda (selector) (if selector a b))) (define (FIRST aPair) (aPair #t)) (define (SECOND aPair) (aPair #f))
This approach uses boolean values in addition to
functions, as well as an if
expression.
We could use message passing to simulate how objects work. This generalizes the idea of a selector function to allow different (and more) arguments.
(define (MAKE-PAIR a b) (lambda (selector) (cond ((eq? selector 'first ) a) ((eq? selector 'second) b)))) (define (FIRST aPair) (aPair 'first)) (define (SECOND aPair) (aPair 'second))
This approach uses symbols, and symbol equality, in
addition to functions, booleans, and an if
.
Both of these solutions use functions in conjunction with another data type to implement a pair. Can we implement a pair using only functions?
We can. This implementation uses nothing but functions:
(define (MAKE-PAIR a b) (lambda (proc) (proc a b))) (define (FIRST aPair) (aPair (lambda (x y) x))) (define (SECOND aPair) (aPair (lambda (x y) y)))
I love this last solution. Whenever I see it, I smile. It hints at how much one can do with nothing but functions.
The
lambda calculus
underlies most programming language theory and inspired the
creators of Lisp, Scheme, Racket, and many other other
languages. It relies solely on function definition, function
application, and variable substitution to do all of its
computation. It does not even use boolean values or an
if
statement, which seem to be at the core of
every programming language we know. Maybe those things
aren't really essential
after all?
Run the Code
This file contains all eight Racket implementations of pairs shown above. Try them out!
And, so that you know this isn't just a strange phenomenon
available only in Racket, here are
two implementations of the pair in Python,
including the pure function implementation...
(Remember, Python has lambda
s, too.)
PS: There Is Always More
Of course, if we think a little harder, we can probably find cool ways to use Racket's other primitive types, such as numbers, strings, and symbols, to encode a pair. Those implementations will take a little bit more effort — and code. They also might not be as general.
Indeed, in this session three years ago, one of the students (thank you, Henry!) asked if we could implement a pair using the set data type that we implemented in Homework 7 and used in Sessions 19-20.
I did not know... At first, it seemed impossible, because a pair is ordered and sets are unordered. Even so, while the students worked on Quiz 3, I worked on this challenge as my quiz. It turns out that it is possible! If you'd like to see how, see this implementation.
One moral of this story is:
Do not assume anything about the implementation of an interface — even the simplest interface!
Some of these implementations might be outside the scope of your imagination just yet. The pure functional implementation probably is. I hope that our study of data abstraction will stretch our minds to a point where these don't seem so strange. Note in particular that we will use the 'message passing' approach above to implement object-oriented programming in Racket.
Setting The Stage
For the last six sessions, we have been exploring the idea of
syntactic abstractions, those features of a language
that are convenient to have but not essential to the language.
We considered several examples: local variables, local functions,
non-if
selection statements, logical connectives,
and — most recently — variable names. Our goal in
studying these syntactic abstractions was not to study Racket
per se but to see why and how language interpreters
provide such abstractions. Indeed, you can identify many
abstractions in other languages you know.
Beginning with this session, we move on to another sort of abstraction that all languages provide: data abstraction. We will introduce the idea of data abstraction by returning to an idea you know well: data types and their implementations.
Data Abstraction
Programming requires two kinds of abstraction.
A syntactic abstraction offers a different way to express behaviors. It does not add to what can be expressed in the language, but it does add to what can be expressed conveniently.
A data abstraction offers a different way to express values. Usually, a data abstraction allows you to both represent and manipulate the data. In practice, these data are often aggregate values, but that is not necessarily true, as we saw at the beginning of the session.
As with syntactic abstractions, a data abstraction does not add to the set of problems that can be solved in a language. It makes some solutions easier or more convenient to create.
We have already used one data abstraction extensively in this
course: Racket's list, which is constructed out of the more
fundamental type, the pair. Racket lists are implemented in
terms of another data type, so they are a data abstraction.
Technically, we don't need lists, but they make our jobs more
convenient. Each list is built out of a sequence of pairs
(primitive cons
cells) and the empty list (the null
pointer). The language provides an interface that, for the
most part, hides from us the underlying data representation.
When we build a list incorrectly, Racket occasionally reminds us that lists are built out of pairs, by showing "dotted pair" notation when displaying the structure. We can even use dotted pair notation ourselves to express lists and other structures:
> '(1 . (2 . (3 . ()))) '(1 2 3) > (cons 1 (cons 2 3)) '(1 2 . 3)
Of course, the idea of constructing one type out of another is not peculiar to Racket. Can you think of an example from some other language you know?
The one that comes immediately to my mind from your core CS
courses is the Java ArrayList
. An
ArrayList
is constructed out of an array, which is
a more fundamental type. Java compilers know about arrays, but
they don't have to know much about ArrayList
s. They
do know how to manipulate classes as abstractions, though, and so
they can compile and manipulate ArrayList
s.
Indeed, every user-defined class in Java is a data abstraction implemented in terms of other values. This holds as well for the classes defined as a part of the Java programming language. Object-oriented programming is a style in which programmers create data abstractions that make it more convenient to write some solutions — and to maintain and extend them over time.
Another prominent data abstraction from another language is C++'s
class
construct. As designed, C++ programs were to
be compiled by C compilers, even though C compilers do not
recognize classes. Instead, the C++ pre-processor
would translate any code that creates and uses classes and their
instances into equivalent C code. The pre-processor translates
all classes into equivalent C struct
s. So, while
the primitive term class
is a syntactic abstraction,
it is also a data abstraction.
The Racket list is a powerful data structure, due largely to its
flexibility, but it is also rather inefficient when it comes to
accessing elements. We cannot easily or efficiently access any
item directly; instead, we must step through items one at
a time. Knowing that a list is really a linked structure of
cons
cells gives this fact away, because we know from
our data structures course that linked structures provide O(n)
access time.
Racket provides another data aggregate, the vector, which we have used only occasionally to date. Vectors are a primitive data type. We will soon put vectors to more frequent use. You should find that vectors feel rather familiar, based on your programming experiences in other languages. At this point, though, you will want to refresh your memory about vectors by reviewing the Racket Guide, paying particular attention to:
-
the print form of vectors
#( ... )
, and -
the basic predefined functions
vector
andvector?
vector-length
andvector-ref
-
vector->list
andlist->vector
In your data structures course, you learned about hash tables and how to implement them using arrays. We could do the same thing using Racket vectors, but the language creators have saved us the effort by providing this data abstraction as a primitive.
Data Abstractions as Sugar
In our discussion of syntactic abstractions, we saw that we could
define a syntactic abstraction such as let
, write a
program that translates expressions in a language containing
let
expressions into expressions in a language without
let
expressions, and our interpreter wouldn't know
the difference. We might like to be able to do that with new
data abstractions, too. Sometimes, we can.
A great example of programmers adding their own data abstraction
to a language is
Generic Java.
Those of you who know Java or Ada have learned about generic data
types. C++ offers the same capabilities with its template
facility. In the beginning, Java did not provide generic class
definitions in any way. This caused programmers a lot of grief, in
in particular all the downcasting we had to do when retrieving
objects from a container, such as a Vector
.
Well, some programmers solved the problem for themselves by adding
generics to the language as a data abstraction. They defined new
syntax for expressing generic classes such as
Vector<String>
, and then wrote a preprocessor
that translated code containing generic classes into regular Java.
Eventually, the team in charge of Java decided to add generic classes to the language. When Java 1.5 arrived in the summer of 2004, it included generic types as a part of the language, based on the Generic Java extension. The programmer-added feature became a language feature. Java's implementation of generic types is a great example of data and syntactic abstraction.
Wrap Up
-
Reading
- Review these lecture notes, especially the sections on data abstraction, which we covered briefly in class.
- Read this short refresher on making data abstractions, especially the distinction between interface and implementation. We will be using these ideas throughout the next unit of the course.
- Read this short refresher on mathematical functions. We will be running with those ideas in our next session.
-
Homework
- Nothing new yet. Homework 9 will be available next time. It will the first of a three-part adventure in which you implement a working interpreter for a new language.
-
Quiz
- Quiz 3, over syntactic abstractions, is today at the end of class.