#### (An archive question of the week)

Last time we looked at a formula for approximating the **mode** of grouped data, which works well for normal distributions, though I have never seen an actual proof, or a statement of conditions under which it is appropriate. We have also received questions about a much more well-known, and well-founded, formula to estimate the **median**. Here, it is possible to give a solid derivation, and to clearly state the assumptions on which it is based.

## The basics

Here is the initial question, from 2007:

Derivation of Linear Interpolation Median Formula Median, m = L + [ (N/2 – F) / f ]C. How does this median formula come? My teacher did not show and proof how does this formula come. Therefore, I just substitute and blindly use the formula. Can you help me? This formula is used to find the median in a group data with class interval. The median is the value of the data in the middle position of the set when the data is arranged in numerical order. The class where the middle position is located is called the median class and this is also the class where the median is located. This formula is used to find the median in a group data which is located in the median class. Median, m = L + [ (N/2 – F) / f ]C L means lower boundary of the median class N means sum of frequencies F means cumulative frequency before the median class. Meaning that the class before the median class what is the frequency f means frequency of the median class C means the size of the median class I have tried to use an ogive graph to understand, but I still did not get how did this formula come.

Daya recognized that the formula is related to the ogive (also called the Cumulative Distribution Function, or CDF), but wasn’t able to complete the derivation. The formula is, again, $$m = L + \left( \frac{\frac{N}{2} – F}{f}\right)C.$$ For a well explained source, see

Math is Fun: Mean, Median and Mode from Grouped Frequencies,

which I referred to last time; this says, under *Estimating the Median from Grouped Data*,

Estimated Median= \(L + \frac{(n/2) − B}{G} × w\)

where:

Lis the lower class boundary of the group containing the mediannis the total number of valuesBis the cumulative frequency of the groups before the median groupGis the frequency of the median groupwis the group width

I answered with a statement of what the formula does, and a quick derivation:

This is a linear interpolation (on the ogive graph, as you suggested), which findswhere the actual median WOULD be if you assume that the data are uniformly distributed within the median class. One way to derive the formula is just to note that N/2 is the number of data values BELOW the median, so N/2 - F is the number of data values in the median class that are below the median. Therefore, (N/2 - F)/f is the fraction of values in the median class that are below the median. This times C is that fraction of the class width; adding L gives the value at that position in the class. In terms of the ogive (cumulative distribution), let's first just plot the actual cumulative frequency before each class, something like N+ * | * | * | | * --- + . . . . . . ^ | |f | v F| * --- | * *----+----+----+----+----+----+ L |<-->| C We don't know where the actual data points are, but if they are uniformly distributed within each class, we could connect the points above with straight lines. Your formula gives the x coordinate corresponding to y=N/2. See if you can derive it this way.

Our formula gives the x-coordinate of the point on the graph where y = N/2. Here is a better version of the graph:

## A specific example for clarification

In 2016, another student, Pramod, asked about the same formula, giving his own derivation that led to a slightly different formula:

Given this frequency distribution table: 60-70 4 70-80 5 80-90 6 90-100 7 ------------------ n = 22 I used the following rationale to calculate the median. Median data entry = (22 + 1)/2 = 11.5th entry from first = 11.5 - 9 = 2.5th of 6 entries through 80-90 Now, since I don't know the 6 data entries of median class, I assumed that they were distributed equally through 80 to 90 (10 class width): 81.667, 83.333, 85, 86.667, 88.333, 90 I used these in the formula Median = L + {(n + 1)/2) - c.f.} * (h/f) Here, L = lower limit of median class h = class width c.f. = cumulative frequency up to the preceding class f = frequency of median class n = total data entries/summation of frequencies I got Median = 2.5th data entry = (83.333 + 85)/2 = 84.1667 But in almost every statistics book I have ever studied, the formula for calculating median from a continuous frequency distribution table is given as Median = L + {(n)/2) - c.f.} * (h/f) I know very well that the median calculated from such data is not exact, since we know only the range of data entries -- not the actual data entries, themselves. But still, does't it make more sense to use my formula? Doesn't it give a more precise approximation? If you agree, why is the latter formula used in almost every textbook?

This was an excellent attempt, and just missed two details. I responded by first referring to the answer above (to which this question was later attached):

I discussed this formula for Daya, above, but I didn't go into the details of the derivation to confirm that that formula could not be improved upon. I have a small problem with your example:you didn't clearly state how to interpret your classes. Let's take a closer look at your data. class freq ----- ---- 60-70 4 70-80 5 80-90 6 90-100 7 ---- n = 22 Which class is 70 in? I will assume that 80-90 means 80 <= x < 90, as is commonly done for continuous data; if the values are integers, then the class could also be described as 80-89 (inclusive), but then our estimate would have to be rounded to an integer, so we would not get a similar formula.

When classes are described in terms of integer values, the lowest and highest values in a class are called the class limits. But in a formula such as this, we need to treat the data as continuous, so we use, not these class limits, but the class boundaries, which are real numbers halfway between classes. Here, the lower boundary of the median class would be 79.5, which is 0.5 below the lower limit, 80. (Note that the word boundary is used in both statements of the formula above.)

I didn’t take this distinction into account in my answer to Pramod; and his work suggests that he is in fact assuming continuous (real number) data.

The principal error in Pramod’s derivation was including the lower limit (or boundary) of the next class in the median class:

If the 6 values in the class 80 <= x < 90 are evenly spaced across these 10 units, then they are spaced 10/6 = 1 2/3 units apart. I would center them like this: 5/6 __5/3__ __5/3__ __5/3__ __5/3__ __5/3__ 5/6 / \ / \ / \ / \ / \ / \ / \ * * | * * * * +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ 80 81 82 83 84 85 86 87 88 89 90 Therefore, the 2.5th value is 83 1/3 -- that is, 80 + 2*5/3, not 80 + 2.5*5/3. The standard formula gives Median = 80 + [(22/2) - 9] * (10/6) = 80 + 2*5/3 = 83 1/3 This agrees with my answer.

If I had used the class boundary assuming integer values, the median would be $$m = L + \left( \frac{\frac{N}{2} – F}{f}\right)C = 79.5 + \left( \frac{\frac{22}{2} – 9}{6}\right)\cdot 10 = 82 \frac{5}{6}.$$ Everything in my line graph below would be shifted left by 1/2.

In the page above, the implication is that we would use the continuous CDF (for your example) like this: n=22| * + / | / | / | * --- | / ^ 11| . . . . . . |f=6 + / v F=9| * --- | / | / | * | / 0*----+----+----+----+ 60 70 80 90 100 |<-->| C=10 Linear interpolation puts the median 2/6 of the way from 80 to 90, giving 83 1/3 again.

Here is a more accurate graph:

The graph vindicates the formula.

The difference between my first approach and yours is that I was a little more careful to distribute the values uniformly within the entire interval; whereas your last value is right at the end of the interval (and, I think, really in the next interval!). The fact that this results in the same answer obtained for a piecewise-linear CDF is encouraging.

Note, though, that if we really had integer data, we couldn’t uniformly distribute 6 values across 10 units; that’s another sense in which the formula is only approximate. It necessarily assumes a continuous distribution, in addition to the piecewise-linear CDF.

gGood. Except the class size here is not 10 but eleven.

60-70 has a width of eleven numbers. This makes your median to be 83.2.

Dave PetersonThat would be true

ifthe classes had been given as60-70: 4

71-81: 582-92: 6

93-103: 7

That way, the first class would contain the 11 numbers 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, and the next would start with 71.

But that is not what Pramod said. What he

didsay would include 70 in two classes, 60-70 and 70-80,ifit meant what you are assuming, namely a discrete distribution in which the range is given inclusively.That’s why I said, “Which class is 70 in? I will assume that 80-90 means

80 <= x < 90, as is commonly done for continuous data.”In order for the classes to make sense, we have to interpret them in the continuous sense, with 60 and 70 being class

boundaries(division pointsbetweenintervals), not classlimitsas you are taking them (lowest and highest values in the class). And that implies that the class width is indeed 10, just as Pramod said. Pramod’s error was a little more subtle than that, as I explained in the post, namely including both boundaries in a class, as if they were class limits.So what I said was correct.

AudreyHi Sir, if 60-70 = 4 Where is four from?

Thanks

Dave PetersonThat line is part of the problem as given, namely a frequency distribution. It means that there were 4 times when some quantity was between 60 and 70 (which I interpreted as meaning 60 ≤ x < 70). It does not mean that 60 - 70 = 4, of course. It's a notation that is read as "60 to 70". For an introduction to the concept, see here.

Vijay GuptaHello Sir, Suppose the following is given:

class freq

—– —-

60-70 4

70-80 6

80-90 7

90-100 3

Now I know that the median here is in the class 70-80, and I also know that the median would be the 10th value. In this case, would the frequency of the previous class interval still be considered – which is 4? (even though we know that the median value is the last value in the second class interval – 70-80)!

I mean, can I consider the frequency of 6 instead?

Please advise.

Thank you,

Vijay Gupta

Dave PetersonMy general answer to questions that ask, “Can I do this instead?” is, “Try it and find out!”

As I read this, the intervals are probably meant as continuous, the first one being 60 ≤ x < 70; if so, then 80 is actually the

firstvalue in the class starting with 80, not thelastin the class before that. But you may be meaning something different.There are 10 values below 80 and 10 values 80 or greater, so I would call the median 80, just by inspection. (With 20 values, I would take the median to be the 10.5th value, that is, the average of the 10th and 11th, not the 10th; since we don’t have access to the individual values, we can’t do that.)

Since 80 is “on the edge” between two classes, it could make sense to take either class as the “median class” in the formula.

If we take 70-80 as the median class, the formula gives 70 + [(20/2 – 4)/6]*10 = 80.

If we take 80-90 as the median class, the formula gives 80 + [(20/2 – 10)/7]*10 = 80.

So it doesn’t seem to make a difference. And if you look at my discussion of the derivation of the formula, you can see why.

Vijay GuptaThank you Sir, it is much clear now. Most grateful.

Shusil Khanalthankyou sir.

Pingback: Cumulative Distribution Functions (Ogive) – The Math Doctors

Patrick OmungoWhat happens when the median classes begin from zero and the median class when ranked falls at zero. Do you assume zero is the median. (Variable is number of trainings attended)

Dave PetersonI’m not sure exactly what you mean by “the median class when ranked falls at zero”. But suppose that the median class is from 0 to 2, say, so that its midpoint is 1, and that its frequency is 16 (out of 30 in the dataset). Then the class boundaries are -1/2 to 2 1/2, so that L = -1/2, N = 30, F = 0, f = 16, and C = 3. The formula gives m = L + [ (N/2 – F) / f ] * C = -1/2 + [ (30/2 – 0) / 16 ] * 3 = 2.3125 (that is, 2 5/16). This is, of course, only an estimate of the true median, based on the assumption that these 16 people have values evenly distributed from -1/2 through 2 1/2. Since the values are actually 0, 1, and 2, the actual median could in principle be 0, 1, or 2, depending on the distribution.

Note that if the first class is the median class, then f has to be at least N/2 so that this one class will contain at least half the data. You would definitely prefer to use the raw data and find out how many actually are zero, because the classes are far too wide. If more than half of your people attended no training sessions, then the median is indeed zero.