It’s been a while since we’ve written about statistics, so I want to start a short series about that. Here, we’ll look into **stem-and-leaf plots** (also called stemplots).

## Creating and using a stem-and-leaf plot

We’ll start with a question from 1997:

Stem-and-leaf Graph or Stemplot Hi! I was doing a math-a-thon and I got a problem about astem leaf graph. I am in the advanced math class. My math teacher said it would take two days to teach his advanced class how to do it. Can you help?

Doctor Chita answered:

Hi Scott: Sure, I can try to help. Astem-and-leaf graph, also called astemplot, is a way to represent the distribution of numeric data. It was invented by John Tukey, a mathematician, and is a quick way to picture data for numbers that are greater than 0. I'll explain using an example.

Tukey’s “exploratory data analysis” is used to visualize data by hand, when there are not too many numbers; the plot looks much like a histogram, showing the “shape” of the data at a glance, but includes the actual data values. It can also be used as a trick for sorting data, as we’ll see. (It *can* actually be used with negative data, but we rarely see that.)

### Making a stemplot

Suppose you have the following set of numbers (they might represent the number of home runs hit by a major league baseball player during his career). 32, 33, 21, 45, 58, 20, 33, 44, 28, 15, 18, 25 Thestemof a stemplot can have as many digits as needed, but theleavesshould contain only one digit. To create a stemplot to display the above data, you must first create the stem. Since all of the numbers have just two digits, start byarranging the tens digitsfrom smallest to largest. 1 2 3 4 5

Usually we will be dealing with two-digit numbers; sometimes we need to round in order to have only two digits, and often we need to work around a decimal point, as we’ll see. We think of each number as consisting of a “leaf”, the last digit, that identifies the individual number, and the rest of the number as a “stem”, by which the numbers are grouped.

To create theleaves, draw avertical barafter each of the tens digits and arrange theones digitsfrom each number in the data set in order from smallest to largest. If there are duplicate numbers, like 33, list each one. 1|58 2|0158 3|233 4|45 5|8 The shape of the resulting display looks something like a bar graph oriented vertically. By examining the stemplot, you can determine certain properties of the data.

For example, to plot the first number, 32, we put its leaf, 2, to the right of its stem, 3. Commonly we will initially place the leaves in the order they arrive, which for our example of 32, 33, 21, 45, 58, 20, 33, 44, 28, 15, 18, 25 will produce this:

1|58 2|1085 3|233 4|54 5|8

For some purposes, it can be left unsorted like this; but for the uses to which we will put it, we need to sort the leaves on each stem, as he did above:

1|58 2|0158 3|233 4|45 5|8

In doing this, we have sorted all the numbers, which we can read back out as

15, 18,

20, 21, 25, 28,

32, 33, 33,

44, 45,

58

Now we can put the plot to use.

### Finding the median and mode

You can find themedianby counting from either end of the stemplot until you find its center. Here, since there are 12 numbers, the center lies between 28 and 32. The median is the average of the two data points: (28+32)/2 = 30.)

Here I have colored the leaves in spectrum order as I crossed them out, working from each end, and ending with the two middle numbers in **bold**:

1|~~58~~2|~~015~~83|2~~33~~4|~~45~~5|~~8~~

Here we can see the middle numbers, 2**8** and 3**2**. The median is their average. (If there had been an odd number of values, we would have found one middle number, which would be the median.)

We can also just count the total number of data values, \(2+4+3+2+1=12\), and count 6 from one end (left to right, top to bottom) and 6 from the other end (right to left, bottom to top) to find the middle:

```
---->
1|58
2|0158
```**|**
3|233
4|45
5|8
<----

Both approaches use the fact that the leaves represent all the data values listed in order, making this a shorthand for the complete sorted list.

You can also determine if there is amodein the data set by looking at the plot. Here, the number 33 is the mode since it is the only value that occurs more than once.

We can determine this simply by looking in each row for duplicate digits:

1|58 2|0158 3|2334|45 5|8

In general, there could be no mode, or several. See Three Kinds of “Average”.

### Handling larger numbers

If your data containthree-digit numbers(like batting averages, for example), you can use the same technique. For example, let's assume the data are 298, 303, 285, 311, 225, 315, 250, 305 Ignore theones digitsin each number (these will be the leaves) and look at theremaining two digitsin each number (the hundreds and tens digits). The stem will begin at 22 because the smallest number in the data set is 225. The stem will end at 31 because the largest number is 315. Include the two-digit numbers between 22 and 31 in the body of the stem.

It’s important to note that even stems with *no* leaves are to be included (see below), in order to accurately reflect the shape of the entire distribution. This is why we first find the smallest and largest numbers and list all stems between them, rather than just writing them as we find them.

Once you have the stem, thenlist the ones digitsin each number after the corresponding two-digit number before it. The stemplot will look like this, with no leaves after the numbers without a corresponding value. 22|5 23| 24| 25|0 26| 27| 28|5 29|8 30|35 31|15 If these data represent the batting averages for a particular player, this display indicates that he has had a very successful career - most of his averages are clustered between 280 and 320.

If the numbers were more widely scattered (e.g. from 225 to 791, with 58 stems from 22 to 79, rather than just ten), this method would not work well, and we would probably round to the nearest ten, so that the stems would have only one digit.

One thing not mentioned here is that we often find a decimal point in the data, which we ignore; the plot above could just as well have represented the data 0.298, 0.303, 0.285, 0.311, 0.225, 0.315, 0.250, 0.305, or the data 2.98, 3.03, 2.85, 3.11, 2.25, 3.15, 2.50, 3.05. For this reason, it is common to include a “key” to explain the interpretation. For the original set of data, this might look like

Key: 29|8 = 298

For the others, it might be

Key: 29|8 = 0.298 Key: 29|8 = 2.98

## Finding the mean

A 1996 question fills in a little gap:

Stem and Leaf Plots Dear Dr. Math, I am in the Math Counts math competition, and when doing practice problems we came across this problem: Use thestem-and-leaf plotof the recent art project scores tofind the mean score. Express as a decimal. 5 | 0 0 4 | 9 7 3 3 1 3 | 8 7 2 | 9 What in the world is a stem-and-leaf plot? Thank you very much, Molly

Here, rather than starting with data and *making* a stemplot, we are given one and asked to *interpret* it. (Note that the stems here are given in reverse order.) Doctor Robert answered, not giving a full explanation, but focusing on how to find the mean:

Stem and leaf plots are a way that statistician can look at the distribution of numbers given to them to analyze. For example, in the stem-and-leaf plot you show, there were two scores in the 50's (They were both 50), 5 scores in the forties (49, 47, 43, 43, 41), two scores in the thirties (38, 37) and one score in the twenties (29). So all of the art scores were50, 50, 49, 47, 43, 43, 41, 38, 37, and 29. You can find the average score by adding them and dividing by 10.

So the mean is just $$\frac{50+50+49+47+43+43+41+38+37+29}{10}=\frac{427}{10}=42.7$$

The mean doesn’t fit as well into this format as the median and mode; here we are just extracting the original data and finding their mean, rather than using the numbers as displayed. I’ll suggest a possible alternative below.

## Finding the mode, mean, and median

One last question, from 2002, will provide a useful review.

Mode, Mean, and Median in Stemplots I'm trying to help my 6th grader do homework. How do I find a "mode," "mean," and "median" using a stem/leaf plot? Problem: stem leaf 1 889 2 035579 3 138 4 235

Doctor TWE answered:

Hi Linda - thanks for writing to Dr. Math. Each stem-and-leaf combination represents a data point in our set. So to find the mode, mean, and median of the set, we have to figure out how to interpret their definitions for this type of representation.

Presumably the student this time knows how to make and read a stemplot, which in this example represents the data $$18,18,19,20,23,25,25,27,29,31,33,38,42,43,45$$

### Mode

Themodeis defined as the data value that occursmost often. So we are looking for the leaf (number) that occurs the most often on one stem of the diagram. In your example, there are two 8 leafs on the 1 stem (i.e. two data points of value 18), and two 5 leafs on the 2 stem (i.e. two data points of value 25). So the data set is"bi-modal" with modes of 18 and 25. Note that I did not count the 5 leaf on the 4 stem because it represents a different value (45) - it just happens to have the same last digit as my mode of 25. I similarly did not count the 8 leaf on the 3 stem, nor the three different 3 leaves.

This is important: Digits on different stems represent different numbers, so we are not counting identical digits, but identical digits *on the same stem*. The two 9’s do not represent the same number, so we ignore them. Here, the two modes are in red and in green:

1889 2 035579 3 138 4 235

$$\mathbf{{\color{Red}{18,18}}},19,20,23,\mathbf{{\color{DarkGreen}{25,25}}},27,29,31,33,38,42,43,45$$

### Mean

Themeanis the conventional "average," and perhaps the best way to find this is to do it the conventional way -add the values and divide by the number of numbers. With the stem-and-leaf plot, that means that we'll have to "read" each stem-and-leaf as a conventional number. For your example we'll get: (18+18+19+20+23+25+25+27+29+31+33+38+42+43+45) / 15 = 436/15 = 29.1 (Do you see how I got the numbers I added?)

We could, instead, add all the leaves, then add the sum of each stem digit multiplied by its number of leaves, in order to more directly use the stemplot format: $$(8+8+9+0+3+5+5+7+9+1+3+8+2+3+5)+3(10)+6(20)+3(30)+3(40)=\\76+[30+120+90+120]=76+360=436$$ I haven’t seen this done, though!

We can also observe that the mean is located in the middle of the data, as indicated by the asterisk:

```
1 889
2 035579
```*****
3 138
4 235

### Median

Themedianis themiddle valuein the set. This is relatively simple. Start crossing off pairs of high and low leaves. Start with the leftmost leaf on the bottom stem and the rightmost leaf on the top stem. When you only have one (or two) leaves left that have not been crossed out, that value (or the average of the two values) is the median. In your example (I'm using matching symbols to show which two were crossed out as a pair): stem leaf 1 X*# 2 -+=@7@ 3 =+- 4 #*X The one I'm left with is the 7 leaf on the 2 stem, so the median is 27.

That is, using the coloring scheme I used above,

1~~889~~2~~0355~~7~~9~~3~~138~~4~~235~~

In real life we would just mark digits in the order I did here, crossing them off or underlining. And the process is just what we do when the data are all written out: $$18,18,19,20,23,25,25,\mathbf{27},29,31,33,38,42,43,45$$