Measures of spread
Indeed, the five number summary provides extensive
information as to where the data in a data set lies, but sometimes it is nice
to have just two numbers characterizing a data set. If this is the case,
complementing a measure of location with a measure of spread or variation is
an appropriate choice. The data sets {10, 30, 50, 70, 90} and
{40, 45, 50, 55, 60} both have the mean=median=midrange=50, but they differ
in how much the data is spread out. There are several statistics which
characterize the amount of spread:
The range is the extent of the data set; it is defined as the maximum minus
the
minimum. For data sets A and B the range is 7 and
8,
respectively. For the weights of students the
range is 235-105=130. Note that the range is a single number, the difference
between the maximum and minimum, *not* an ordered pair specifying the minimum
and maximum. Note also that the range is defined from the extreme individuals,
hence does not measur how far a typical data value is from the middle.
You only need to know the maximum and minimum to calculate the midrange and range; if you know both the midrange and range, you can calculate the maximum and minimum.
The inter-quartile range is defined as Q3-Q1.
For the weights of students the inter-quartile range is
175-130 = 45. Note that the inter-quartile range is a single number and not
the ordered pair consisting of the quartiles. [Different definitions for the quartiles will produce different
inter-quartile ranges.] Since Q3 is the middle of the data above the median,
and Q1 is the middle of the data below the median; Q3-Q1=(Q3-Q2)+(Q2-Q1) is
twice the average distance of a datum from the median. (The semi-interquartile range has been defined as half the interquartile range, but is rarely used.) Note that knowing
the median and the inter-quartile range does not let you calculate the first or third quartile (or the minimum
or maximum).
There are several approaches to measuring the average distance from the mean.
A first notion might be (1/n)*sum*(x(i) - x-bar),
where
there are n data points. However, it is readily verified that this quantity
is always 0, hence it is not of much use. Negative distances cancelling out
positive distances can be avoided by employing the absolute value:
(1/n)*sum*|(x(i) - x-bar)|; this quantity is called the mean
deviation
(or the mean absolute deviation).
It is a nice concept, but is not suitable for many mathematical manipulations,
hence is not widely employed.
Another way to avoid negative summands is to square them.
(1/(n-1))*sum*((x(i) - x-bar)^2) is called the variance, which is denoted by
s^2. [The reason for dividing by n-1 rather than n, is that this is the
estimate for the variance of a population based on a sample; if we had divided
by n we would have still called it the variance, but denoted it with *sigma*^2
where *sigma* is lower case sigma.] Evaluating this expression for data set
A yields ((2-5.4)^2 + (3-5.4)^2 + (5-5.4)^2 + (8-5.4)^2 + (9-5.4)^2)/4 = 9.3.
This is not a good measure of the average distance from the mean, but its
square root 3.05 is (taking the square root essentially undoes the previous
squaring). The square root of the variance is called the standard deviation,
and denoted by s. For the weights of students
the variance is 881.77, and the standard deviation is 29.69.
Competencies: For the data set {2 5 9 4 6 7 6 8 8}, calculate the
variance, standard deviation, range, and inter-quartile range.
Reflection: For the above data set, which of the above statistics best
describes the spread of the data?
Challenge: Is the variance always greater than the standard deviation?
Is the interquartile range always greater than the standard deviation? When
will the variance, standard deviation, range, and interquartile range be equal?
return to index
Questions?