Yes, this would fall under my fourth answer, and depends very much on the nature of the data you are given (which is not necessarily normal). Unfortunately, such modeling is beyond my level of expertise; so I would revert to my earlier suggestions unless precision was worth hiring a professional statistician!

]]>Could you use a normal distribution on the data in the other classes and model its skewness and kurtosis, then from this give an estimate for the upper limit?

I have run across a similar problem calculating the average wind speed in an area over the year, where the data is presented as the number of days that had wind speeds in certain classes. To further complicate this problem, there were varying class widths!

Not sure how to adjust and model a distribution curve from partial data though. Sorry I cant help much further, hopefully someone else can?!

]]>Reread my answer to Pramod (2016), who used the (N+1)/2 (simulated) data value as you want to do. With my correction to how he was distributing data values across the class, I showed that we get the same answer as the formula containing N/2. My conclusion was that his work, based on modeling the grouped data with evenly spaced data points, was just another (but harder) way to get the same result as the formula that I derived from a piecewise-linear continuous CDF.

Again, the definition of the median for a continuous distribution is the value such that the cumulative probability is 1/2; we just multiplied N by that. In the formula, we aren’t “finding the N/2th observation”; we are measuring 1/2 on a continuous scale from 0 to N. Pramod used the (N+1)/2th observation, and got the same result. So the two approaches are not in conflict, but equivalent.

I have to think about this more, but I suspect that what happens here may be related to the “continuity correction” (subtracting 1/2) that we use when we apply the normal distribution to estimate a binomial CDF.

]]>Hi, Aditya.

First, on the gaps: That’s a fair question; to me it seemed obvious, perhaps from experience spacing pictures evenly on a wall, but it does deserve explanation.

One way to think of it is that I want each point to be in the **middle** of an equal **region**, so in effect I divided the class into 6 equal parts and put the data points at the center of each. In the example, the first region would be from 80 to 80 + 5/3, and the middle of that is at 80 + 5/6.

To make it more visual, here is a simplified example in which I have two classes, each 12 units wide, and I want to put 3 points evenly spaced in each. If I made all **gaps** equal, including those at the end of a class, I would divide by 4 (since there are **4 gaps**) and space the points 3 apart in each class:

But instead, I divide by 3 (since there are **3 points**), and put each point in the middle of an interval with width 4:

Which looks more evenly spaced? The latter, I think! The spacing to the end points of the classes is not meaningful; the spacing between data points is. At the boundaries of a class, that spacing is shared with the adjoining class, which is why we want half as much.

Now, on the ogive: Actually, we are assuming that **no** data point is exactly 80 (just as in my pictures above I put no data points at the boundaries). The point (80, 9) represents the given fact that 9 data points are **less than** 80. (That is, 4 + 5 are in the previous classes.) This is what the CDF means. It does not imply that the 10th is at 80.

But in your thinking about this, keep in mind that nothing we do can be exact, because we don’t know the actual data points anyway. All we can use is the data we have. The CDF is just a convenient approximation to reality, in which the 80 is an arbitrary location, not necessarily relevant to the actual data.

]]>Sir, I am a 10 grade student from India. I was not able to understand that while uniformly distributing the data over the median class, why did you take the first and last gap as 5/6 and all other gaps as 5/3? As we have to distribute the data uniformly, shouldn’t all gaps be equal?

Also sir, while drawing the ogive graph above, according to my understanding the point (80,9) represents that the 9th observation is assumed to be 80. But sir shouldn’t it be corrected as (80,10) because the 9th observation belongs to the previous class of 70-80, and the 10th observation belongs to 80-90? Please explain. Awaiting for your kind reply. Thanks.

]]>You asked in Turkish, “How did you solve this?”. We’ll see whether Carlos has more to say. But as I said, it was probably a “non-solution” like those I discussed.

]]>