#### (An archive question of the week)

Last time we looked at a formula for approximating the **mode** of grouped data, which works well for normal distributions, though I have never seen an actual proof, or a statement of conditions under which it is appropriate. We have also received questions about a much more well-known, and well-founded, formula to estimate the **median**. Here, it is possible to give a solid derivation, and to clearly state the assumptions on which it is based.

## The basics

Here is the initial question, from 2007:

Derivation of Linear Interpolation Median Formula Median, m = L + [ (N/2 – F) / f ]C. How does this median formula come? My teacher did not show and proof how does this formula come. Therefore, I just substitute and blindly use the formula. Can you help me? This formula is used to find the median in a group data with class interval. The median is the value of the data in the middle position of the set when the data is arranged in numerical order. The class where the middle position is located is called the median class and this is also the class where the median is located. This formula is used to find the median in a group data which is located in the median class. Median, m = L + [ (N/2 – F) / f ]C L means lower boundary of the median class N means sum of frequencies F means cumulative frequency before the median class. Meaning that the class before the median class what is the frequency f means frequency of the median class C means the size of the median class I have tried to use an ogive graph to understand, but I still did not get how did this formula come.

Daya recognized that the formula is related to the ogive (also called the Cumulative Distribution Function, or CDF), but wasn’t able to complete the derivation. The formula is, again, $$m = L + \left( \frac{\frac{N}{2} – F}{f}\right)C.$$ For a well explained source, see

Math is Fun: Mean, Median and Mode from Grouped Frequencies,

which I referred to last time; this says, under *Estimating the Median from Grouped Data*,

Estimated Median= \(L + \frac{(n/2) − B}{G} × w\)

where:

Lis the lower class boundary of the group containing the mediannis the total number of valuesBis the cumulative frequency of the groups before the median groupGis the frequency of the median groupwis the group width

I answered with a statement of what the formula does, and a quick derivation:

This is a linear interpolation (on the ogive graph, as you suggested), which findswhere the actual median WOULD be if you assume that the data are uniformly distributed within the median class. One way to derive the formula is just to note that N/2 is the number of data values BELOW the median, so N/2 - F is the number of data values in the median class that are below the median. Therefore, (N/2 - F)/f is the fraction of values in the median class that are below the median. This times C is that fraction of the class width; adding L gives the value at that position in the class. In terms of the ogive (cumulative distribution), let's first just plot the actual cumulative frequency before each class, something like N+ * | * | * | | * --- + . . . . . . ^ | |f | v F| * --- | * *----+----+----+----+----+----+ L |<-->| C We don't know where the actual data points are, but if they are uniformly distributed within each class, we could connect the points above with straight lines. Your formula gives the x coordinate corresponding to y=N/2. See if you can derive it this way.

Our formula gives the x-coordinate of the point on the graph where y = N/2. Here is a better version of the graph:

## A specific example for clarification

In 2016, another student, Pramod, asked about the same formula, giving his own derivation that led to a slightly different formula:

Given this frequency distribution table: 60-70 4 70-80 5 80-90 6 90-100 7 ------------------ n = 22 I used the following rationale to calculate the median. Median data entry = (22 + 1)/2 = 11.5th entry from first = 11.5 - 9 = 2.5th of 6 entries through 80-90 Now, since I don't know the 6 data entries of median class, I assumed that they were distributed equally through 80 to 90 (10 class width): 81.667, 83.333, 85, 86.667, 88.333, 90 I used these in the formula Median = L + {(n + 1)/2) - c.f.} * (h/f) Here, L = lower limit of median class h = class width c.f. = cumulative frequency up to the preceding class f = frequency of median class n = total data entries/summation of frequencies I got Median = 2.5th data entry = (83.333 + 85)/2 = 84.1667 But in almost every statistics book I have ever studied, the formula for calculating median from a continuous frequency distribution table is given as Median = L + {(n)/2) - c.f.} * (h/f) I know very well that the median calculated from such data is not exact, since we know only the range of data entries -- not the actual data entries, themselves. But still, does't it make more sense to use my formula? Doesn't it give a more precise approximation? If you agree, why is the latter formula used in almost every textbook?

This was an excellent attempt, and just missed two details. I responded by first referring to the answer above (to which this question was later attached):

I discussed this formula for Daya, above, but I didn't go into the details of the derivation to confirm that that formula could not be improved upon. I have a small problem with your example:you didn't clearly state how to interpret your classes. Let's take a closer look at your data. class freq ----- ---- 60-70 4 70-80 5 80-90 6 90-100 7 ---- n = 22 Which class is 70 in? I will assume that 80-90 means 80 <= x < 90, as is commonly done for continuous data; if the values are integers, then the class could also be described as 80-89 (inclusive), but then our estimate would have to be rounded to an integer, so we would not get a similar formula.

When classes are described in terms of integer values, the lowest and highest values in a class are called the class limits. But in a formula such as this, we need to treat the data as continuous, so we use, not these class limits, but the class boundaries, which are real numbers halfway between classes. Here, the lower boundary of the median class would be 79.5, which is 0.5 below the lower limit, 80. (Note that the word boundary is used in both statements of the formula above.)

I didn’t take this distinction into account in my answer to Pramod; and his work suggests that he is in fact assuming continuous (real number) data.

The principal error in Pramod’s derivation was including the lower limit (or boundary) of the next class in the median class:

If the 6 values in the class 80 <= x < 90 are evenly spaced across these 10 units, then they are spaced 10/6 = 1 2/3 units apart. I would center them like this: 5/6 __5/3__ __5/3__ __5/3__ __5/3__ __5/3__ 5/6 / \ / \ / \ / \ / \ / \ / \ * * | * * * * +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ 80 81 82 83 84 85 86 87 88 89 90 Therefore, the 2.5th value is 83 1/3 -- that is, 80 + 2*5/3, not 80 + 2.5*5/3. The standard formula gives Median = 80 + [(22/2) - 9] * (10/6) = 80 + 2*5/3 = 83 1/3 This agrees with my answer.

If I had used the class boundary assuming integer values, the median would be $$m = L + \left( \frac{\frac{N}{2} – F}{f}\right)C = 79.5 + \left( \frac{\frac{22}{2} – 9}{6}\right)\cdot 10 = 82 \frac{5}{6}.$$ Everything in my line graph below would be shifted left by 1/2.

In the page above, the implication is that we would use the continuous CDF (for your example) like this: n=22| * + / | / | / | * --- | / ^ 11| . . . . . . |f=6 + / v F=9| * --- | / | / | * | / 0*----+----+----+----+ 60 70 80 90 100 |<-->| C=10 Linear interpolation puts the median 2/6 of the way from 80 to 90, giving 83 1/3 again.

Here is a more accurate graph:

The graph vindicates the formula.

The difference between my first approach and yours is that I was a little more careful to distribute the values uniformly within the entire interval; whereas your last value is right at the end of the interval (and, I think, really in the next interval!). The fact that this results in the same answer obtained for a piecewise-linear CDF is encouraging.

Note, though, that if we really had integer data, we couldn’t uniformly distribute 6 values across 10 units; that’s another sense in which the formula is only approximate. It necessarily assumes a continuous distribution, in addition to the piecewise-linear CDF.