Mean and Standard Deviation of Grouped Data

Two of our most-viewed posts deal with Mode and Median of Grouped Data: how to calculate these statistics for data that is supplied in the form of frequencies for classes of data (bins), rather than the individual data values. Here we’ll complete that topic with a look at the less troublesome cases of Mean and Standard Deviation, including some issues that arise in the grouping process itself. Next time I’ll look at another grouping issue that was raised in a recent question.

Grouped data (continuous)

Here is a question from 1999:

Statistics of Grouped Data

I am preparing to take a statistics course after many years of being able to avoid doing stats. I am doing some preparatory work before the course starts. Can you please give me answers to the following questions?

                      Grouped Data

  Income (*$1000)  Midpoint(x)  Number of Purchasers
  ---------------  -----------  --------------------
    20 - 29.99         25               50
    30 - 39.99                          20
    40 - 49.99                          31
    50 - 59.99                          39
    60 - 69.99                          35
    70 - 79.99                          30
    80 - 89.99                          25
    90 - 99.99                          18

The above table is data from a survey of recent purchasers of superannuation plans.

1. Find the mean and standard deviation of the income of people purchasing superannuation plans.

    Find the mean
    Find the variance
    Find the standard deviation

2. Find the median class.

3. Choose a suitable graph and display the frequency distribution.

4. Summarize the findings.

Tony is asking for basic instruction in calculating the mean, variance, and standard deviation of a frequency distribution. The table (a frequency distribution) shows that, for instance, 50 people in the survey had incomes from $20,000 through $29,999.99 (assuming that 29.99 doesn’t mean, literally, $29,990, but really means “anything less than $30,000”; some authors would write “20 – <30”). These numbers are called “class boundaries”, and are relevant when the data are continuous, allowing in effect any real number from 20,000 up to (but not including) 30,000 in this class.

The midpoint column is not filled in except the first line; it represents the average of the high and low values for the class, in this case \((20 + 30)\div 2 = 25\). The idea is that if the people in this class are uniformly distributed across this interval, their average income would be $25,000. In effect, we will be pretending that there are 50 people with that income, and 20 with $35,000 (the midpoint of the next class), and so on.

Let’s fill in the rest of the midpoints:

  Income (*$1000)  Midpoint(x)  Number of Purchasers
  ---------------  -----------  --------------------
    20 - 29.99         25               50
    30 - 39.99         35               20
    40 - 49.99         45               31
    50 - 59.99         55               39
    60 - 69.99         65               35
    70 - 79.99         75               30
    80 - 89.99         85               25
    90 - 99.99         95               18

Mean

Doctor Mitteldorf replied, answering each question in turn:

To start: find the mean of the distribution as follows.

First, find the total number of buyers. Do this by adding up the column with the numbers from each income category. Second, find the total of all their incomes. This you can do approximately, since you have an estimate of the incomes of the people in the group. On each line, multiply the midpoint income times the number of people in the group. Add up the products. This should give a reasonable estimate of the total income. Divide total income by total buyers to give the mean income.

There are 50 people in the first class, 20 in the second, and so on; so the total is 248. We’ll write this sum at the bottom of that column.

If there are 50 people with 25 thousand dollars, their total income is \(50\times 25 = 1250\) thousand dollars. We do that for each row, and add them up:

  Income (*$1000)  Midpoint(x)  Number   N*x
  ---------------  -----------  ------  ----
    20 - 29.99         25         50    1250
    30 - 39.99         35         20     700
    40 - 49.99         45         31    1395
    50 - 59.99         55         39    2145
    60 - 69.99         65         35    2275
    70 - 79.99         75         30    2250
    80 - 89.99         85         25    2125
    90 - 99.99         95         18    1710
    ------                       ---   -----
    Total:                       248   13850

So the 248 people have about $13,850 thousand total, and the mean income is \(13850\div 248 = 55.85\) (thousand dollars).

Standard deviation

To find the variance, you should create another column in which you are squaring the midpoint incomes before multiplying by the number of people. Add up those numbers, and divide, as before, by the total number of people to obtain the <mean squared income>. This is not the same as the <mean income> squared - that quantity is just the number you calculated at first, multiplied by itself. In fact: you can subtract the <mean income> squared from the <mean squared income> to give the variance of the income distribution.

Here is our table with two new columns, the square of the midpoint, x, and the product with the number, \(Nx^2\):

  Income (*$1000)  Midpoint(x)  Number   N*x   x^2   N*x^2
  ---------------  -----------  ------  ----   ---  ------
    20 - 29.99         25         50    1250   625   31250
    30 - 39.99         35         20     700  1225   24500
    40 - 49.99         45         31    1395  2025   62775
    50 - 59.99         55         39    2145  3025  117975
    60 - 69.99         65         35    2275  4225  147875
    70 - 79.99         75         30    2250  5625  168750
    80 - 89.99         85         25    2125  7225  180625
    90 - 99.99         95         18    1710  9025  162450
    ------                       ---   -----        ------
    Total:                       248   13850        896200

The mean squared income is \(896200\div 248 = 3613.71\).

The variance is then \(3613.71 – (55.85)^2 = 494.85\).

There are other formulas you may see, but this is the easiest.

The standard deviation is the square root of the variance.

I hope that gives you a start. Dive in, try it, and report back what you think you understand. Explain as much as you feel comfortable with, and a little more. We'll try to help with a "mid-course correction" if you're not getting the ideas 100%.

So our standard deviation is \(\sqrt{494.85} = 22.25\) (thousand dollars). Tony didn’t answer back.

Grouped data (discrete)

In 2000, Laura asked a similar question:

Standard Deviation of Grouped Data

Interval (grouped) data:

  Interval (group)     Frequency
  ----------------    -----------
       37-46              19
       47-56              23
       57-66              27
       67-76              28

What is the standard deviation of the data?

I have another question. If I had an answer of 31.615 for the mean of a different data set, would I round it off to 32 or 31.62?

Doctor TWE answered in detail, with some small differences that make this worth going through:

I'll break this down by steps.

Step 1: Find the number of data points.
To find the number of data points, add up the values in the Frequency column of the table:

  Interval  Freq.
  --------  -----
    37-46     19
    47-56     23
    57-66     27
    67-76     28
             ---
              97

We’ll need this both for the mean and for the standard deviation.

Step 2: Find the midpoint of each interval range.
To find the midpoint, add the top and bottom of each interval range and divide by two. For example, the first interval range is 37 to 46, so the midpoint is:

     (37 + 46) / 2  =  83 / 2 = 41.5

Do this for each interval range. Add a column to your table for this (I'll put it between the Interval and Frequency columns):

  Interval  Midpt.  Freq.
  --------  ------  -----
    37-46    41.5     19
    47-56      :      23
    57-66      :      27
    67-76      :      28
                     ---
                      97

Notice one difference so far from the problem above: the class intervals this time are given as sets of discrete values, rather than as a continuous sets, so the highest value of the first is 46, not “just before 47”. The data are taken to be integers. As a result, the midpoint of “37 through 46” is the average of the first and last values.

The first and last discrete values in a class are called “class limits”, as opposed to the “class boundaries” we had last time. This will come back to haunt us below!

I’ll be leaving the rest of the work “as an exercise for the reader” this time, with the answer at the end.

Mean

Step 3: Find the estimated sum of the data.
To find the sum, multiply the midpoint of each interval range by the frequency of that interval range. For example, the midpoint of the first interval range is 41.5 and the frequency is 19, so the sum is:

  41.5 * 19 = 788.5

Do this for each interval range. Add another column to your table for this (I'll put it after the Frequency column), then find the sum of that column (I'll just call this S):

  Interval  Midpt.  Freq.   Sum
  --------  ------  -----  -----
    37-46    41.5     19   788.5
    47-56      :      23      :
    57-66      :      27      :
    67-76      :      28      :
                     ---   -----
                      97     S

This S is called the estimated sum, because it is based on the assumption that every value in a class is equal to the midpoint, which is almost certainly not true. The sum will be valid if the average of the values in each class is equal to the midpoint; that is probably not exactly true, but may well be a good approximation.

Step 4: Find the estimated mean (or "average") of the data.

Divide the sum of the data (S, found in step 3) by the number of data points (found in step 1). In our example,

  Mean = S / 97

Again, we don’t know the exact mean because we don’t have the exact data; but this is the best we can do.

Standard deviation

Step 5: Find the squares of the midpoints of each interval range.
For each interval range, find the square of the midpoint. Add another column to your table for this (I'll put it after the Sum column). For example, the midpoint of the first interval range is 41.5, so the square is:

  41.5^2 = 1722.25

Do this for each interval range:

  Interval  Midpt.  Freq.   Sum   Midpt^2
  --------  ------  -----  -----  -------
    37-46    41.5     19   788.5  1722.25
    47-56      :      23      :       :
    57-66      :      27      :       :
    67-76      :      28      :       :
                     ---   ----- 
                      97     S

This is just what we did before.

Step 6: Find the estimated sum-of-the-squares of the data.
To find the sum-of-the-squares, multiply the square of the midpoint of each interval range by the frequency of that interval range. For example, the square of the midpoint of the first interval range is 1722.25 and the frequency is 19, so the sum-of-the-squares is:

     1722.25 * 19 = 32722.75

Do this for each interval range. Add another column to your table for this (I'll put it after the Midpt^2 column), then find the sum of that column (I'll just call this S2):

  Interval  Midpt.  Freq.   Sum   Midpt^2  Sum-Sqrs
  --------  ------  -----  -----  -------  --------
    37-46    41.5     19   788.5  1722.25  32722.75
    47-56      :      23      :       :         :
    57-66      :      27      :       :         :
    67-76      :      28      :       :         :
                     ---   -----           --------
                      97     S                S2

It’s important to note that this is not the sum of the squares of the midpoints themselves, but of all the data — as if we had 19 41.5’s, and 23 51.5’s, and so on.

Step 7: Find the estimated mean square of the data.
Divide the sum-of-the-squares of the data (S2, found in step 6) by the number of data points (found in step 1). In our example,

  Mean square = S2 / 97

We have, in effect, added 97 squares, and then divided by the count, giving us an average square for all the data.

Step 8: Find the estimated variance and standard deviation of the 
data. 
To find the variance, square the mean (from step 4), then subtract it from the mean square. Note that the mean square and the square of the mean are not the same!

  Var = (Mean square) - (Mean)^2

To find the standard deviation, take the square root of the variance.

  StDev = sqrt(Var)

Note that these values are estimates, because with grouped data, you don't have the exact figures to work with. Your means, squares, variance and standard deviation are all based on estimations of the actual data.

Let’s finish the work:

  Interval  Midpt.  Freq.   Sum   Midpt^2   Sum-Sqrs
  --------  ------  -----  -----  -------   --------
    37-46    41.5     19   788.5  1722.25   32722.75
    47-56    51.5     23  1184.5  2652.25   61001.75
    57-66    61.5     27  1660.5  3782.25  102120.8
    67-76    71.5     28  2002    5112.25  143143.
                     ---  ------           ---------
                      97  5635.5           338988.3

So the mean is \(\frac{5635.5}{97} = 58.098\approx 58.1\), the mean square is \(\frac{338988.3}{97} = 3494.7\), the variance is \(3494.7-58.098^2 = 119.35\), and the standard deviation is \(\sqrt{119.35} = 10.9\).

Rounding

As for the final question about rounding:

That depends on the accuracy and precision of the original data. In some scientific fields, there are very specific rules for determining the number of significant figures to leave in an answer, and they can get quite complicated. As a general rule, your final answer should have the same precision (i.e. the same number of decimal places) as the LEAST precise data point. So, for example, if I had the data set:

  16.725, 31.0625, 24.5, 22.50, 19.75

I'd compute the mean as:

  (16.725 + 31.0625 + 24.5 + 22.50 + 19.75) / 5 = 22.9075

Then I'd round it to 22.9 (NOT 22.91) because my least precise data point (the 24.5) had only one decimal place in it.

We’ve discussed significant figures elsewhere. Note that this advice doesn’t really fit the present situation, where we weren’t given any actual data values; my rounding to the nearest tenth seems appropriate.

Boundary issues

In 2009, we got the following question pertaining to the boundaries used in distributions:

Class Intervals in Statistics

I can't feel comfortable with the issue of having a negative boundary when we have data which is made up of purely positive numbers.  The best way to explain would be with an example:

The number of breakdowns in a machine with the data is grouped from 0-4, 5-9, 10-14, 15-19 etc..

The midpoints of each interval would be taken from the midpoints of the lowest and highest boundary.  No problem normally: the midpoint of the 5-9 boundary is the midpoint of 4.5 and 9.5, i.e. 7 

But what about 0-4?  Surely the lower boundary must be zero, giving a midpoint of 2.25?  However, textbooks tend to say it should be -0.5, giving a midpoint of 2.

I believe that if the data is essentially positive, the boundaries can't go below zero.  Trivial it may seem, but I hate ambiguity.

This requires some background. Oliver has been given discrete data as in our second example, where the values are all exactly integers (0, 1, 2, 3, or 4 in the first class, for example). But for some purposes, we want to treat the data as if it were continuous; this is done for histograms, and for particular procedures such as a “normal approximation”. (We don’t know for sure what Oliver’s context is.)

In this process, called “continuity correction”, we take the boundary between classes 0-4 and 5-9 is taken to be the average of 4 and 5 – halfway between. So the boundaries of the class with limits 5 and 9 are 4.5 and 9.5; and the midpoint of that class is \((4.5 + 9.5)\div 2 = 14\div 2 = 7\). (You can also get the midpoint directly from the limits, as I’ve shown above: \((5 + 9)\div 2 = 14\div 2 = 7\).)

That makes sense … until you look at the lower boundary of the first class, which is taken as -0.5. How can we use negative numbers in a problem about non-negative numbers (machine breakdowns)?

I replied:

What's happening here is sort of a pretend "boundary" being used to convert a discrete variable (the number of breakdowns, which must be a whole number) into a continuous variable (location on the x-axis of the histogram).

You want columns on a histogram whose MIDPOINTS represent the actual values. If you didn't have classes, there would be columns for 0, 1, 2, 3, 4, and so on; if the midpoint of a column is at 0, and the width is 1, then it must extend from -0.5 to +0.5:

    +-+
    | |
    | +-+
    | | | +-+
    | | +-+ |
    | | | | |
    | | | | +-+
    | | | | | |
  ===+=+=+=+=+=+=...
     0 1 2 3 4 5

Recall that a bar graph has bars representing counts of discrete things, and the labels just name that thing, such as “1”. But in a histogram, a bar’s width represents an interval of values containing the data. A bar centered around the number 0, with width 1, will naturally extend 0.5 on either side of the 0.

With classes, you will have one bar representing the entire class:

    +---------+
    |         |
    |         |
    |         |
    |         |
    |         |
  ===+=+=+=+=+=+=...
     0 1 2 3 4 5

It should still cover the same interval on the axis, so it goes from -0.5 to 4.5; its midpoint is (-0.5 + 4.5)/2 = 2.

That's just a formality, and allows us to pretend that any value, not just whole numbers, is allowed.  Note that the midpoint is the same as what it is if you ignore all this and just take the actual discrete values: (0+1+2+3+4)/5 = 2; or if you just treat 0 and 4 as the endpoints (leaving a gap of 1 between bars, which is a no-no): (0+4)/2 = 2.

It’s also worth noting that there are 5 numbers in this class (0, 1, 2, 3, 4), and the width of the bar is just what it ought to be: \(4.5 – (-0.5) = 5\). Without the halves on each end, this would not be true.

So you'll never really get a count of -0.5, any more than you'll get a 4.5; these boundaries are equally fictitious!  And if you prefer, you never really have to mention -0.5 in your calculations.  But it allows us to have a histogram like

    +---------+
    |         |
    |         +---------+
    |         |         +--
    |         |         |
    |         |         |
  ===+=+=+=+=+=+=+=+=+=+=+=...
     0 1 2 3 4 5 6 7 8 9 10

that uniformly covers the axis, rather than

     +-------+
     |       |
     |       | +-------+
     |       | |       | +-
     |       | |       | |
     |       | |       | |
  ===+=+=+=+=+=+=+=+=+=+=+=...
     0 1 2 3 4 5 6 7 8 9 10

where there are gaps, and the endpoints teeter on the edge of their bars.

Oliver replied:

Thank you for your reply, I'm happier with this now, as you mention that it's just a formality to help us create histograms and also the midpoint of the actual discrete values is the same.  I mistakenly thought before that including the -0.5 gave a negative bias, but now it's clearer.  Much appreciated!

The negative bias would mean that everything is shifted over to the left; but in each bar, we have shifted both to the left and to the right, widening the bar. There is no bias.

Do you have to use the midpoint?

We’ll close with a question from 2016, when a teacher asked about a student’s unusual method for finding a standard deviation:

Standard Deviation, Non-Standard Definition

In the formula for standard deviation, we always use 'x' for the midpoint of the group. But can 'x' represent the upper boundary of the group?

This comes from a test question that asked my students to find the standard deviation of grouped data. I wrote out my own steps, with x representing the midpoint of each group, and got 10.49 kg. One of my students used 'x' to represent the upper boundary of each group -- and she got 10.49 kg, too.

We used the midpoint in our calculations above; but using the upper boundaries seems to give the same result. Is there something funny here?

I answered:

The student is wrong, of course, because you need to use the right definition. But we can see why, as long as she is consistent, it turns out not to matter, as far as the standard deviation is concerned.

First, note that any attempt to calculate statistics from grouped data is just an approximation -- not the real thing. We don't know the actual data, so we can't find the real mean or standard deviation. We are PRETENDING that all the data in a group (also called a class or a bin) are equal; and we are finding the mean and standard deviation of THOSE values, not the real data. But using the midpoint makes sense, because -- as the name suggests -- it is likely to be in the middle of whatever the actual values are in each class, if they are uniformly distributed over the interval, and therefore will be close to the actual value. So this is how we define the mean and standard deviation in this situation.

So the midpoint is appropriate in the definitions (of mean and of standard deviation), because it is likely to be a good approximation of the data.

But if we take the midpoint-based fake data and replace everything with upper boundaries, we are simply adding half the width of a class to every (fake) data value. The result is that the mean will also be w/2 higher than the mean obtained from the midpoints; and the standard deviation will be exactly the same, because it is based only on the deviations from the mean, which are unchanged. This fact is something students should know (eventually, at least)

So the student’s work will result in the wrong mean, but the right standard deviation, because the latter is all about differences (deviations), which don’t change. The student has biased her work to the right, but that bias doesn’t affect the standard deviation, which measures only the spread, not the location, of the data. (This is easier to see in the other formula for standard deviation that we haven’t looked at here …)

I would expect this student to get the mean wrong but the standard deviation correct. If she got the mean right, I'd be surprised -- and I'd want to ask whether she did what she did intentionally, getting the mean using the midpoints but the s.d. using upper boundaries. 

Maybe she knew what she was doing!

I suspect not, though. We never heard back to be sure.

Next time, I’ll look at a different sort of boundary issue.

1 thought on “Mean and Standard Deviation of Grouped Data”

  1. Pingback: Grouped Data: Open-ended Classes? – The Math Doctors

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.