Finding the Mode of Grouped Data

The mode of a list of data values is simply the most common value (or values … if any). When data is grouped (binned) as in a histogram, we normally talk only about the modal class (the class, or group, with the greatest frequency), because we don’t know the individual values. But some sources teach a formula for finding (actually just estimating) the mode. We’ve had a number of questions about that formula.

The formula

I had never heard of such a formula until 2007, when a question was asked about applying it in a special case. That answer wasn’t archived, but when we got another question about it a year later, it was time to publish what I had figured out. Here is the 2008 question, from Saptarshi:

Different Formulas for Calculating Mode

I am a M.B.A student. Our teacher tells a formula to find out mode, that is Z=L1+(F1-F0)/(2F1-F0-F2)*i

where: L1 = lower limit of modal class
       F1 = modal class frequency.
       F2 = just after the modal class frequency.
       F0 = just previous the modal class frequency.
        i = class interval.
        Z = the mode value.

But I saw in most of cases the highest frequency is the mode. They don't use that formula. (I saw that when searching about mode in google). So why we need that formula? Can you please explain me.

I think he is saying that whereas he was taught this formula for the mode, most sources he found online do as I have usually seen, identifying only the class with the greatest frequency as the mode (actually the modal class). So, why was he taught this formula, and what does it mean?

The formula, which I now find more easily around the Web than I could back then, takes several forms. His, in more readable format, is $$Z = L_1 + \frac{F_1 – F_0}{2F_1 – F_0 – F_2}\cdot i.$$ The form we had previously been asked about was a little different: $$Z = L_1 + \frac{d_1}{d_1 + d_2}\cdot i,$$ where \(d_1\) and \(d_2\) are the differences between the frequency of the modal class and those of its nearest neighbors.

I started this answer by stating what the formula does, and showing the two formulas to be equivalent:

This formula gives a linear interpolation to estimate the actual value of the mode from grouped data; otherwise, all you really know is the modal class (which is sufficient for many purposes).

Your formula can be written differently if we take

  d1 = F1 - F0  (difference between modal class and previous class)
  d2 = F1 - F2  (difference between modal class and next class)

Then d1 + d2 = (F1 - F0) + (F1 - F2) = 2F1 - F0 - F1, so the formula is

  Z = L1 + d1/(d1 + d2) * i

I have never found an explanation of the formula in a mathematical source that explains its proper derivation and the conditions under which it is valid; there are several sites that explain it after-the-fact as I will do below, but most sources I find are at a basic level where they just state the formula and tell students to use it. Most, in fact, just state that it gives the mode, whereas, as stated above, it is really only a guess — an estimate of what the actual mode might be, based on the shape of the histogram. We don’t really know how the data are distributed within any of the classes, so it is impossible to know the actual mode; it may not even be in the modal class. On the other hand, the actual mode may just reflect that some random data points happen to be identical; a number based on the overall shape may really be more meaningful! So this is a valid concept, at least in some situations.

One source that gives this formula with a proper description is

Math Is Fun: Mean, Median and Mode from Grouped Frequencies

which says, under “Estimating the Mode from Grouped Data”,

We can easily find the modal group (the group with the highest frequency), which is 61 – 65.

We can say “the modal group is 61 – 65″.

But the actual Mode may not even be in that group! Or there may be more than one mode. Without the raw data we don’t really know.

But, we can estimate the Mode using the following formula:
Estimated Mode =\(L +  \frac{f_m − f_{m-1}}{(f_m − f_{m-1}) + (f_m − f_{m+1})} × w\)
where:

  • L is the lower class boundary of the modal group
  • fm-1 is the frequency of the group before the modal group
  • fm is the frequency of the modal group
  • fm+1 is the frequency of the group after the modal group
  • w is the group width

I gave some references, and then quoted what I had said in answering the 2007 question:

The formula these sites give, with definitions of the variables, is (using the second site's version):

  When data are already grouped in a frequency
  distribution, we can assume that the mode is
  located in the class with the most items. In
  order to determine a single value for the
  mode from this modal class, we use

    mode = LBMo + [d1 /(d1+d2)] (Width)

  where

    LBMo = lower boundary of the modal class
    Width = width of the modal class interval
    d1 = frequency of the modal class minus
         the frequency of the class directly below it
    d2 = frequency of the modal class minus
         the frequency of the class directly above it

Note that d1 and d2 relate to the classes on the left and on the right in the histogram.  If there is no class on the left, then you can imagine a class with frequency zero.  Then the formula applies easily.

Note that this source rightly said the formula only gives “a single value for the mode”, not “the actual value of the mode”.

My last paragraph above dealt with the issue the earlier questioner had been asking about.

The purpose of this formula is to identify one value within the modal class that seems likely to be the peak of the curve if you smoothed out the histogram.  It does this by taking the value within the interval whose distance from the class on either side is proportional to how much less the frequency is on either side.   You can see this by rewriting the formula:

  mode - L1     d1
  --------- = -------
    Width     d1 + d2

That is, the distance from the lower bound (left end) of the modal class, as a fraction of the width of the modal class, is the ratio of the left difference to the sum of the differences.

In thinking about this relationship, I saw a graphical meaning to the formula (which I now see on various other sites; I’m sure I’m not the first to have seen it):

There is a simple geometrical way you could find this point.  Just draw lines from the top corners of the modal bar to the near corners of the neighboring bars, and the mode estimate lies at the intersection:

            +---------+
            |  \    / |d2
          d1|     X   |
            |   / :   +---------+
            | /   :   |         |
  +---------+     :   |         |
  |         |     :   |         |
  |         |     :   |         |
  |         |     :   |         |
  |         |     :   |         |
  +---------+-----:---+---------+
            L1   mode
            |<------->|
               width

This puts the estimated mode closer to the higher neighboring bar, which makes sense. (I’ll have more to say about that below.) If you’re not sure how this relates to the proportion I wrote, look for a pair of similar triangles …

I closed with an example (again, quoted from my response a year earlier, and using that writer’s notation, where “85<91” meant the six numbers starting at 85, and less than 91):

For an example, take these classes:

  85<91          10
  91<97           8
  97<103          3
  103<109         8
  109<115         0
  115<121         7

The modal class is 85<91.

  LBmo = 85
  width = 6
  d1 = 10 - 0 = 10 (since the frequency on the left is 0)
  d2 = 10 - 8 = 2  (since the frequency on the right is 8)

  mode = LBMo + [d1 /(d1+d2)] (Width)
       = 85 + (10/12)(6)
       = 85 + 5
       = 90

This is 5 from the left and 1 from the right, a ratio of 5:1, while the differences in frequency are 10:2.

The “mode” depends on the classes

The next question about this formula was in 2015, from Gaurav:

Mode's Fickle Formula?

The formula for mode is not telling me the actual mode. In fact, after grouping data, I have found many situations where the mode changes.

For example, given these data:

   1, 1, 1, 1, 2, 3, 3, 3, 4, 4, 4

The mode is 1. 

But after grouping data, as below, the mode becomes approximately 3.3:

   CLASS       FREQUENCY    
    1-3            5
    3-5            6

Why does the mode of data change like this?

It appears that Gaurav had not been taught that the formula gives only a guess at the mode, and can’t be expected to give the actual mode, since it doesn’t have access to the actual data. But the question provided a good opportunity to examine more closely what the formula actually does. I replied:

The formula you have presumably been given for the mode of grouped data does not necessarily give the actual mode. Rather, it gives you a guess that is considered reasonable under some conditions. 

When you group data, you lose information, so you should expect not to be able to recover detail using any formula. 

In other words, the mode didn't change; you just guessed the mode from insufficient data.

I don't actually know of any theoretical basis for the formula that would make it reasonable to expect it to be correct for some particular kind of data (e.g., approximately normal). But given the questions that we math doctors routinely see about this subject, it appears that it is commonly taught without explaining what the formula really is: an approximation, at best.

I gave a link to the answer above, to make sure we were talking about the same formula. Then I showed how the actual data provided (in the form of a “dot plot”) compare to the histogram:

Note that your data are not normally distributed, so it is not at all surprising that the formula would not work. Also, the actual data (*'s) 
and the grouped data (bars) look quite different:

        +---+
    +---+   |
    |*  |   |
    |*  |* *|
    |*  |* *|
    |* *|* *|
  --+---+---+--
     1 2 3 4

Looking at that, we see that the mode of the actual data is not even in the modal class; this is because the data are not smoothly distributed, so the grouping changes its character. (My guess is that the formula is considered valid, as I suggested, for normally distributed data; it would be at least reasonable for a smooth and symmetrical distribution.)

We should check his work with the formula. Using the formula in the first form I showed above, $$Z = L_1 + \frac{F_1 – F_0}{2F_1 – F_0 – F_2}\cdot i,$$ we have \(L_1 = 2.5, F_0 = 5, F_1 = 6, F_2 = 0, i = 2\) so $$Z = 3 + \frac{6 – 5}{2\cdot 6 – 5 – 0}\cdot 2 = 3 + \frac{1}{7}\cdot 2 = 3.29.$$ Here I took \(L_1\) to be 3, the lower class limit as stated in the first form I quoted above, rather than 2.5, the lower class boundary, as in most versions I have found, in order to get his answer. I think the latter is the proper definition of the variable; I hadn’t noticed this discrepancy until now.

Gaurav asked another question:

Why would the mode of grouped data depend on the frequency of pre- and post-modal classes?

This is essentially asking for a deeper explanation of the formula. I replied:

The page I referred you to explains the formula as well as I can. The basic idea is that if you have data that looks like a normal distribution (one symmetrical hump), but group the data, the classes on either side would be asymmetrical if the actual mode is not centered in the modal class; so looking at the adjacent classes can help estimate where the mode would be within the class. 

Here are two examples:

       symmetrical             asymmetrical

            |                         |
          +-*-+                    +--*+
          *   *                    |*  |*
         *|   |*                   *   +-*-+
      +-*-+   +-*-+               *|   |  *|
      *   |   |   *            +*--+   |   +*--+
  +*--+   |   |   +--*+    +-*-+   |   |   |   *
  +---+---+---+---+---+    +---+---+---+---+---+

The symmetrical histogram should have its mode in the middle of the modal class. The histogram on the right -- with a higher bar on the right of the modal class -- should have its mode closer to the higher side. The formula does this in the simplest possible way.

I have made a histogram by binning the standard normal distribution in various ways, and found that the formula does give the mode quite accurately in that case. When I did the same for a triangular distribution, it was less accurate.

What if there are two modal classes?

Here is a question from 2016:

Breaking the Mode

How do you find the mode of this grouped data?

    data    freq
   ------------- 
   10-14      5
   15-19     12
   20-24     12
   25-29     10
   30-34      4
   
I know the mode formula:

      Mo = L + (d1/(d1 + d2))*width

I calculated its parts like this:

       L = 14.5

      d1 = 12 - 5
         = 7

      d2 = 12 - 10 
         = 2

   width = 24.5 - 14.5 
         = 10

But I'm confused about the last two. Should d2 = 12 - 12 = 0? Should the width be 5?

From there, I went on to determine

    mode = 14.5 + 7/(7 + 2)*10 
         = 14.5 + 7.8
         = 22.3

Is my work true?

The answer seems reasonable (it is at least within a modal class). But does the formula work when the “modal class” is double-wide?

First, we have to keep in mind that we don’t even know what it would mean for an answer to be correct, since we don’t know the actual data! But I answered:

The formula you are using does not really tell you "the mode"; it just makes a reasonable estimate of where the mode might be if the underlying distribution is, say, approximately normal. Since it is not exact in the first place, it probably doesn't matter much how you apply it in special cases. If you have been taught the formula without any further explanation, then you can't be expected to follow any particular rules for this case.

I have never found a source for this formula that explains its theoretical basis, or the conditions under which it should be used, or how it applies in unusual cases (which should be an inference from the theory, if there were one). I've explained what I can guess from the formula, and from what sources I do find, here:

  Different Formulas for Calculating Mode
  http://mathforum.org/library/drmath/view/72977.html 

This explanation for it assumes generally that each class has the same width, so it doesn't quite apply when the "modal class" has twice the width of the others, which is the way you are treating it.

I made a suggestion, to rework the classes so they all have the same width, which is that of the double modal class:

I would probably rework the data so that there are fewer (equal width) classes, and just one modal class:

    data    freq
    ------------
    5- 9      0     [added implied empty class]
   10-14      5
   15-19     12
   20-24     12
   25-29     10
   30-34      4

    data    freq
    ------------
    5-14      5     [combined classes in pairs]
   15-24     24
   25-34     14

The formula applies directly now:

      L = 14.5

     d1 = 24 - 5
        = 19

     d2 = 24 - 14
        = 10

  width = 24.5 - 14.5
        = 10

     Mo = L + (d1/(d1 + d2))*width
        = 14.5 + (19/(19 + 10))*10
        = 21.05

Again, that seems to fit a little better with the derivation, but I don't think it makes much difference, since there is really no "correct" mode anyway! Your answer is not necessarily a bad one.

If anyone reading this knows an original source for the formula that gives a solid foundation for it, rather than just an ad-hoc linear interpolation, I would love to know.

6 thoughts on “Finding the Mode of Grouped Data”

  1. Pingback: Finding the Median of Grouped Data – The Math Doctors

  2. I Was Wondering, How To Find The Mode If The Modal Class Is The Extreme First Or The Extreme Last Because In These Cases We Do Not Find The Frequency Of Both It’s Neighbours ? I Read In Class X And Found The Formula In My Maths Book In The ‘STATISTICS’ Chapter.

    1. As I said in the post above, “If there is no class on the left, then you can imagine a class with frequency zero. Then the formula applies easily.” The same applies when there is no class on the right.

      This was, in fact, the context in which I first came across this formula; in the quoted answer, Different Formulas for Calculating Mode, I introduced my reference to an earlier discussion by saying, “I was asked about this formula a year ago, with specific reference to the case where the modal class is the first class. I had not seen the formula previously, but could see how it arose”.

      Many students seem to have the same question!

    2. In all of the following cases:
      1. Begining class interval has highest frequency
      2.last class interval has highest frequency
      3. Two are more classes have same maximum frequency i.e bi-modal or multimodal
      Correct way of finding mode is by Grouping method. Here we create 6 columns for frequency including 1st column as original frequency. Col2 is addition of 2-frequencies,col3 is obtained by adding 2frequencies leaving 1st. Col4 by adding 3frequencies, col5 by adding 3freq leaving 1st,col6 by adding 3freq leaving 1st 2. Now starting from col1 we take maximum freqiency from each column. Then we write numbers/classes against these freq which contributed to this max freq. Now count the number/class which occcurs maximum times is the mode/modal class. Then for ungrouped data that number is mode and for grouped it is modal class then apply formula. Please refer sc gupta and vk kapoor mode section for example on this.

      1. Thanks for the reference. I don’t have access to Indian books, but have long been aware that many questions on this topic come from India, and have been interested in seeing how it is taught there. I was able to find this book, Fundamentals of Mathematical Statistics: a Modern Approach, available online, and read what is said in section 2.7. (The edition I found may not be identical with yours, however.)

        Your answer, unfortunately, is not directly relevant to the question, as it is a method for finding something they call a mode in the case of a discrete (ungrouped) frequency distribution, not for grouped data. And even in that case, it amounts to redefining “mode”, and appears to be merely a way to be able to identify some “most frequent” value in cases where there really is no mode. The fact that it is called “the method of grouping” can easily lead to confusion with grouped distributions.

        The authors then move on to continuous frequency distributions, giving the formula that is the focus of this post. I have long wanted to find a proper derivation of the formula, so when I saw the heading “Derivation of the mode formula (2-7)”, I was hopeful. But that turns out to be just an explanation of the formula, similar to my own, based on an unjustified assumption that the intersection of the two lines in the picture should be taken as the location of the mode. I am still hoping someone will refer me to an advanced treatment of the formula that will prove that it is the best approximation of the actual mode under suitable conditions, such as normality.

        I see no mention in the book of how to apply the formula in the cases that have been asked about (grouped data with no one modal class, or no neighboring class on one side), so that question, too, is still open. Applying the “method of grouping” to grouped data could, in fact, lead to nonsense, as the “modal class” you find might not be greater than those on either side, making the formula impossible to apply. The formula only makes sense when applied to a class that is actually greater than its neighbors.

  3. Pingback: Mean and Standard Deviation of Grouped Data – The Math Doctors

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.