Organizing data
The Concise Oxford Dictionary defines statistics as
Numerical facts
systematically collected and statistic as Statistical fact
or
item. Data (singular: datum) is facts or information.
Statistics entails all aspects of information: collecting, organizing,
comprehending, communicating, and interpreting. This course will begin with
descriptive statistics (organizing, comprehending, and communicating data),
then spend
a few weeks on probability, which lays a foundation for inferential statistics
(interpreting, extrapolating). The manner of collection of data (experimental
design) is important
and nontrivial, but will not be a focus of this course.
Categorical (also called qualitatitative, and sometimes further
specified
as nominal or ordinal) data is data where what is being recorded cannot be
readily identified with
the real numbers. Examples include colors of cars (red, green, black), size
of eggs (small, medium, large), sex (male, female). One can count the number
of cars of various colors, and display that information in a bar chart or pie
chart, but one cannot combine the various cars and conclude that the average
car is brown. We shall not devote much attention to categorical data because
we cannot manipulate it, but we shall return to it in the context of
the binomial and multinomial distributions.
Quantitative (also further specified as interval and ratio, the
distinction between which is not of interest for our purposes) data is data
where what is being recorded can be identified with the real numbers.
Examples include age, I.Q.,
weight, height. Identification with the real numbers facilitates organizing,
comprehending,
and communicating this data. In particular we can combine it using algebraic
operations.
N.B.: We can count all data, whether categorical or
quantitative, the terms categorical and quantitative refer to the essence of
the individual items which we are counting. The statement 'there are 3 red haired men' entails qualitative data (red haired), the statement 'there are 3 74 inch tall men' entails quantitative data (74).
A natural way to organize quantitative data is with the order property of the
real numbers, i.e., arrange the data from least to graetest. For example, the
30 weights: 185, 160, 235, 165, 125, 175, 185, 132, 168, 112, 170, 155, 105,
158, 120, 190, 140, 185, 125, 180, 145, 110, 155 135, 170, 113, 155, 175, 145,
130 are more easily comprehended in order: 105, 110, 112, 113, 120, 125, 125,
130, 132, 135, 140, 145, 145, 155, 155, 155, 158, 160, 165, 168, 170, 170,
175,
175, 180, 185, 185, 185, 190, 235. Note that each weight has been listed as
many times as it occurs. This information can be visually presented with a
stem-and-leaf plot. A position (e.g., between the tens and units places) is chosen to break the numbers into
a stem and a leaf. The leaf will always be one digit. The stems are listed
on
the left, and the corresponding leaves (if any) on the right. Visually a
stem-and-leaf plot looks like a bar chart; the categories are defined by the
decimal structure of the numbers. A stem-and-leaf plot for the above data
is presented below:
10 | 5
11 | 023
12 | 055
13 | 025
14 | 055
15 | 5558
16 | 058
17 | 0055
18 | 0555
19 | 0
20 |
21 |
22 |
23 | 5
Sometimes to enhance visual presentation of data, stems will be split (e.g.,
repeat each stem on the left, once for the digits 0-4, once for the digits
5-9).
Sometimes data are truncated (rightmost digits dropped) in order to have
an informative plot with single digit leafs.
The use of the decimal structure of the numbers sometimes constrains the
ability of a stem-and-leaf plot to visually display where the data lies. Even
the techniques of spliting stems or truncating (dropping) digits may not be
satisfactory. Histograms allows arbitrary sizes for the categories, but the
categories (classes) must be contiguous, and all be the same size.
N.B.: stem-and-leaf plots are a good preliminary way to organize data
prior to representing it with a histogram.
The essence of a histogram is best illustrated by the method of its
construction.
- Choose the number of classes; this will be an aesthetic judgement based
on the data. Generally you will want between 5 and 20 classes: your goal is
to communicate where the data lies. The number of
classes
is important, although subtle.
- Choose the class size. Divide the range by the target number of classes
above, then round off aesthetically. If you do not end up with the number of
classes above, it will not matter since that number was a rough aesthetic
guess.
- Choose the class marks (centers of the classes) or class boundaries (endpoints of the classes). Again, do this
aesthetically. Since classes are all the same size, if you know the class
marks (which are at the center of the classes),
you know the class boundaries, and vice-versa. It is important that each
datum lies in
exactly one class.
- Draw the histogram. This requires that you count the number of data
which
lie in each class, and make the heights (hence areas) of the bars proportional
to the number of data in each class. Do not forget to label the histogram,
since it will convey no information if it is not labelled.
For the above data set, I would want about 5 classes, since with only 30 data
points there would be too few data per class to accurately portray where the
data lies with more classes. My range is 235-105=130; 130/5=26. Hence I
want approximately 26
pounds per class. Therefore, I will choose 25 for my class size. I prefer
class marks to class boundaries, hence I will choose 100, 125, 150, 175, 200,
and 225 as my class marks. This gives me six classes: 87.5-112.5, 112.5-137.5,
137.5-162.5, 162.5-187.5, 187.5-212.5, 212.5-237.5. Note that by choosing an
odd class size and integer class marks my class boundaries are fractional,
hence
no datum lies on a class boundary. If class boundaries such as 100-120, 120-140, etc. are chosen, left endpoint inclusion is used so that a 120 pound individual would be counted as in the 120-140 class, not the 100-120 class.
10_| _______
| | |
| _______| |
| _______| | |
| | | | |
5_| | | | |
| | | | |
| _______| | | |
| | | | | |
| | | | | |_______ _______
__|__|_______|_______|_______|_______|_______|_______|____
| | | | | |
100 125 150 175 200 225
weights of students in pounds
Weights of Students in Statistics Course
Note that all the original data can be recovered from a stem-and-leaf plot
but you only know the approximate value of the data when it is presented in a
histogram.
Competencies: Give examples of categorical (qualitative) data and
quantitative data.
Present the following weights: {132, 180, 200, 150, 165, 144, 194, 125, 160,
130, 140, 140, 160, 170, 150, 155, 135, 165, 120, 185, 141, 210, 105, 115,
125, 162, 215, 235, 170, 200, 125, 125, 225, 170, 140, 135, 185, 230, 269,
130, 220, 198, 285, 140, 173, 180, 210, 148, 115, 205, 130} as a stem-and-leaf
plot.
Present the above weights as a hisotgram.
Reflection: When would you violate the above rules for making a
histogram, and how would you do it?
Challenge:
return to index
Questions?
July 2007