Slow and Fast Ways to Solve a Probability Problem

Last week’s discussion reminded me of another question, from July, about a probability problem that was solved in a hard (but educational) way and an easy way. This instance is more extreme, and, due to its length, requires extreme editing in order to fit here.

Will the fifth random number be new?

The question came from Geoff (in Australia):

I would like to pose a question to you about probability.

The easiest way I can pose my question is to invent an example of it and that’s what I’ve done here:

1) Imagine we have a machine that can randomly generate the numbers 0,1,2,3,4,5,6,7,8,9. (That is the numbers 0 to 9 only).
2) The machine only generates the numbers one at a time and it’s a perfectly random generation, that is, the probability of any one of those numbers being generated is the same, being 1 in 10 (1/10) for each generation.
3) In this example, we begin by turning the machine on and we will generate numbers one at a time (and as explained above the machine only randomly generates numbers from 0 to 9).
4) We let the machine generate the first 4 numbers. In this example I will just make up any 4 numbers, so let’s just say the first four numbers generated are: 3,8,5 and 0 in that order.

My question is: Is it reasonable to expect that there is a greater likelihood that the 5th number generated randomly by the machine will be a different number to the first 4 numbers?

In other words, since the first 4 numbers generated (in this example 3,8,5,0) have had their ‘turn’, isn’t there a better-than-even chance that the 5th number is more likely to be one of the numbers that hasn’t come up already? That is, either: 1,2,4,6,7 or 9 ?

My feeling is that as we continue along generating the 10 numbers, there is always a greater likelihood that the next number that comes up will be a number that hasn’t come up already. That is not to say that all of the numbers are going to come up in 10 ‘generations’ – there could be all kinds of combinations etc., but since each number has the same probability of coming up on each ‘generation’ doesn’t it therefore follow that if a number hasn’t already come up it’s more likely to come up next time than a number that has already come up ?

I know how to calculate the probability of the 5th number being different from the first 4, but as we go along it gets more complicated doesn’t it?

Is there a formula you can suggest to calculate what are probabilities, as we go along generating the 10 random numbers one after the other, that the next consecutive number will be a different number than the one(s) already generated ? And bearing in mind that there could be duplicates or more of the same number or numbers being generated.

As I initially read this, some of what he says seems to verge on the “Gambler’s Fallacy“: it is not, in general, true that a new number is more likely to come up than one that has been seen. That is true here, only because no more than four numbers have been seen yet! If we keep generating more numbers, the likelihood of a new number approaches zero.

Also, I understand his phrase “greater likelihood” to mean “more likely than not”, rather than comparing to some other event – that is, a “better than even chance”. He is not saying that the probability increases as more numbers are generated.

Here is a sample (from Excel) of 20 sets of 5 random numbers; red marks fifth numbers that are different from any of the first four:

The observed probability is 11/20, or 55%. If this is typical, then the answer to his question is yes: The fifth number is more likely than not to be “new”.

But note that the first four numbers frequently include duplicates, as he mentioned; so it would not be enough just to say that there are six “new” numbers out of 10 possibilities, making the probability 6/10 = 60%. It might be even greater, if fewer numbers have been used. (In fact, this quickly answers the specific question he asked: The probability of a new number is more than 50%; we’ll be working out the precise probability.)

I initially responded with our usual question:

Please show your work for what you say here:

I know how to calculate the probability of the 5th number being different from the first 4.

That will help me see how close you are to an answer to the main question, and what sort of help you need.

Five cases: the start of a long method

Geoff responded with a reasonable attempt based on breaking the situation down into cases:

In calculating what is the probability of the 5th generated being different from the first 4 numbers generated:

There are some different possible combination outcome scenarios for the first 4 numbers and I will work through each:

1) The first 4 numbers generated are all different (e.g. 3, 8, 5, 0), then there are 6 remaining numbers that are different (in this example: 1,2,4,6,7,9), and since any one of those 6 remaining numbers will fulfill my requirement “the 5th number being different from the first 4” – the probability of the 5th number generated being different in this case will be: 6/10 or 0.6 or 60%

2) 2 of the first 4 numbers generated are the same number (e.g. 3, 3, 8, 5 but any array of numbers as long as only 2 of them are the same), then there are 7 remaining numbers that are different, and since any one of those numbers will fulfill my requirement “the 5th number being different from the first 4” – the probability of the 5th number generated being different in this case will be: 7/10 or 0.7 or 70%

3) 3 of the first 4 numbers generated are the same (e.g. 3, 3, 7, 3), then there are 8 remaining numbers that are different, and since any one of those 8 remaining numbers will fulfill my requirement “the 5th number being different from the first 4” – the probability of the 5th number generated being different in this case will be: 8/10 or 0.8 or 80%

4) The first 4 numbers generated are all the same number (e.g. 2,2,2,2), then there are 9 remaining numbers that are different, and since any one of those 9 numbers will fulfill my requirement “the 5th number being different from the first 4” – the probability of the 5th number generated being different in this case will be: 9/10 or 0.9 or 90%

5) The first 4 numbers generated results in 2 pairs of numbers (e.g. 8,0,8,0), then there are 8 remaining numbers that are different, and since any one of those 8 numbers will fulfill my requirement “the 5th number being different from the first 4” – the probability of the 5th number generated being different in this case will be: 8/10 or 0.8 or 80%

So I’ve identified 5 possible combination outcome scenarios for the first 4 numbers generated, each with its own probability attached to the likelihood of the 5th number generated being different:

When the first 4 numbers are:

All different: probability of different 5th number = 6/10

2 of the numbers are the same: probability of different 5th number = 7/10

3 of the numbers are the same: probability of different 5th number = 8/10

All the numbers are the same: probability of different 5th number = 9/10

2 pairs of numbers: probability of different 5th number = 8/10

Actually, I now realise that I don’t entirely know how to calculate the probability that the 5th number will be different from the first 4 numbers generated because I’ve got 5 probabilities for 5 different possible scenarios that can all come up randomly for the first 4 numbers and I don’t know how to combine those 5 probabilities to get the final ‘all-encompassing’ probability.

So how can we combine these separate probabilities for different scenarios to end up with a single probability that explains:

“The probability that the 5th number generated will be different from the first 4 numbers generated”?

If you have a formula for this, please explain.

In each case, the probability of the fifth being different is just the number of unused numbers (that is, 10 minus the number of different numbers used), over the total number of possibilities, 10.

He has done very well as far as he went; but the work to combine these cases will take some time.

I answered:

Hi, Geoff.

Thanks for writing back. You are thinking in a valid direction.

It looked at first as if you could be falling for the Gambler’s Fallacy, which is the idea that outcomes that have occurred less often are more likely to occur in the future, to make up for it. But you emphasize that each number always has the same chance, so I don’t think that’s how you’re thinking.

An alternative (teaser)

Then I suggested an alternative approach, which he chose not to try (and I, in line with our usual plan, encouraged him to continue with his own approach):

There is a simple way to find a general formula for this. I find it easiest to imagine that all 5 digits have been selected randomly already, put in a row and hidden:

_ _ _ _ _

Your question amounts to uncovering the first four and asking for the probability that the last one is different from any of them. But suppose instead you uncover the last one, and ask, what is the probability that this number is not among the hidden numbers? That is, what is the probability that n random numbers will not include one given number? If you think very carefully, you can see that this is equivalent to your question, and it’s very easy to answer.

I’ll leave you to try solving that (and challenge me if you think it isn’t really right!).

We’ll come back and look at my suggestion at the end.

How to combine the cases: conditional probability

But first, I made this comment on his list of cases:

Now, let’s look at the work you’ve shown here, which can be completed to give a solution, which will agree with my way:

When the first 4 numbers are:

All different: probability of different 5th number = 6/10

2 of the numbers are the same: probability of different 5th number = 7/10

3 of the numbers are the same: probability of different 5th number = 8/10

All the numbers are the same: probability of different 5th number = 9/10

2 pairs of numbers: probability of different 5th number = 8/10

These are correct conditional probabilities; we can express them as, e.g.

P(5th different | first 4 all different) = 6/10

If you are unfamiliar with the notation, this means “the probability that the 5th is different, given that the first 4 are all different”.

For brevity, I like to name each case by giving an example, like this:

P(5th different | abcd) = 6/10

P(5th different | abcc) = 7/10

P(5th different | abbb) = 8/10

P(5th different | aaaa) = 9/10

P(5th different | aabb) = 8/10

(In this notation, I have to keep in mind that the letters in each case could be any order, so for example 1251 is an example of type abcc, if you chose a=2, b=5, c=1 and then rearranged. The two that are the same could be anywhere.)

As to combining these, I said,

What you need here is to calculate the probability of each case and multiply by the conditional probabilities you have found; in effect, you will be calculating a weighted average of your probabilities.

That is, the probability you want is the sum of the probabilities of 5 different ways it could happen:

P(5th different) = P(5th different and abcd) + P(5th different and abcc) + P(5th different and abbb) + P(5th different and aaaa) + P(5th different and aabb)

Do you see why?

But we can calculate each of those addends, using the fact that

P(A and B) = P(A) * P(B | A)

which is essentially the definition of conditional probability.

See if this gives you enough understanding to finish your work. The probabilities you have to calculate are a little tricky, especially the aabb case; I used permutations and combinations, and a little more in that case.

It’s a nice problem

For example, $$P(\text{5th different and abcd}) = P(\text{abcd}) P(\text{5th different}|\text{abcd}),$$ and we have found that $$P(\text{5th different}|\text{abcd})=\frac{6}{10}.$$ So we “just” need to find the probability of each case.

Case method: abcd and aaaa

His next reply contained a couple correct parts; I’m going to omit various false steps and show only what’s good:

Hello Dave,

Thanks again for your explanations above.

Here’s my calculation to obtain the probability of the 5th. randomly generated number (from 0 to 9 inclusive) being different from the first 4:

To calculate the probability of each case, divide the number of all possible combinations of each case by the total number of all possible combinations of the first 4 numbers generated.

The number of all possible combinations of the first 4 numbers generated = 10 x 10 x 10 x 10 = 10,000

1) Case abcd : all possible combinations = 10 x 9 x 8 x 7 = 5,040 now divide by 10,000 (all possible combinations of the first 4 numbers generated) 5,040 / 10,000 = 0.504 (probability of this case) multiply by 0.6 conditional probability for this case = 0.3024 weighted average

P(5th different and abcd) = 0.3024

This is correct. The numerator is $_{10}P_4$, the number of ways to permute 4 of the 10 distinct digits; so $$P(\text{abcd})=\frac{_{10}P_4}{10^4}=\frac{10\cdot9\cdot8\cdot7}{10\cdot10\cdot10\cdot10}=\frac{5040}{10000}=0.5040$$

$$P(\text{5th different and abcd})=P(\text{abcd})P(\text{5th different}|\text{abcd})=0.5040\cdot\frac{6}{10}=0.3024$$

4) Case aaaa : all possible combinations = 10 x 1 x 1 x 1 = 10

(10 / 10,000) x 0.9 (conditional probability) = 0.0009

P(5th different and aaaa) = 0.0009

This is correct. There are 10 ways to choose the one digit used, and no more to choose, so

$$P(\text{aaaa})=\frac{10}{10^4}=\frac{10}{10000}=0.001$$

$$P(\text{5th different and aaaa})=P(\text{aaaa})P(\text{5th different}|\text{aaaa})=0.001\cdot\frac{9}{10}=0.0009$$

Case method: aabb

After more attempts, he started to flag, and I gave a last hint:

I have been encouraging you to do as much as you can yourself, expecting that at some point we would reach a point where you would feel stuck. I’ll now tell you almost all that you need, but leave just a little that you may choose to work our yourself, or not. I very much appreciate your attitude, and have been impressed that you’ve gone this far!

I’ll do the hardest case for you, the last one; if you choose to use this example to work out the others, feel free, or just ask me to show the rest of the work.

We want to find the probability of “two pairs”, which I called aabb. How many ways are there to choose such a set of numbers? First, we need to choose two distinct numbers; that can be done in 10C2 = (10!)/(2!8!) = (10*9)/(2*1) = 45 ways. Then we have to choose where to put those two numbers. Call the smaller one “a”, and choose two places to put it. This can be done in 4C2 = (4!)/(2!2!) = (4*3)/(2*1) = 6 ways. In all, we have 45*6 = 270 ways to make aabb; dividing this by 10^4 = 10000, the probability is 0.0270; this will, in the end, be multiplied by 0.8.

So we have

$$P(\text{aabb})=\frac{_{10}C_2\cdot_4\!C_2}{10^4}=\frac{45\cdot6}{10000}=\frac{270}{10000}=0.0270$$

$$P(\text{5th different and aabb})=P(\text{aabb})P(\text{5th different}|\text{aabb})=0.0270\cdot\frac{8}{10}=0.0216$$

Case method: abcc and abbb

After yet more tries, he decided we’d reached the limits of his knowledge, and I had mercy:

I figured you probably weren’t up to speed on combinations and the like, so I probably should have given a reference:

Permutations and Combinations: An Introduction

And since I’ve dragged you along as far as you are willing to go (farther than I expected), it’s time to finish the work for you.

The two remaining cases are what I called abcc and abbb. Here I used both permutations and combinations.

For abcc, we first pick 3 of 10 digits to be a, b, c in order, which can be done in 10P3 = (10!)/(7!) = 10*9*8 = 720 ways; then we have to pick which two places will be the same, which can be done in 4C2 = (4!)/(2!2!) = (4*3)/(2*1) = 6 ways. This makes a total of 720*6 = 4320, so P(abcc) = 0.4320.

Similarly, for abbb, we first pick 2 of 10 digits to be a and b in order, which can be done in 10P2 = (10!)/(8!) = 10*9 = 90 ways; then we have to pick which three places will be the same, which can be done in 4C3 = (4!)/(3!1!) = (4*3*2)/(3*2*1) = 4 ways. This makes a total of 90*4 = 360, so P(abcc) = 0.0360.

This gives $$P(\text{abcc})=\frac{_{10}P_3\cdot_4\!C_2}{10^4}=\frac{720\cdot6}{10000}=\frac{4320}{10000}=0.4320$$
$$P(\text{abbb})=\frac{_{10}C_2\cdot_4\!C_3}{10^4}=\frac{90\cdot4}{10000}=\frac{360}{10000}=0.0360$$

Case method: Conclusion

So here is what we have:

abcd: 0.5040 * 0.6 = 0.3024

abcc: 0.4320 * 0.7 = 0.3024

abbb: 0.0360 * 0.8 = 0.0288

aaaa: 0.0010 * 0.9 = 0.0009

aabb: 0.0270 * 0.8 = 0.0216

sum = 0.6561

That’s the probability we’ve been after: 65.61%. We were right that it is more than 60%, but not by much.

The alternative: reversing perspective to save work

Here is my quick method, which I had hinted at:

We generate 4 digits, then a 5th, and want the probability that the latter is not one of the former. If we reverse our perspective, taking the 5th digit as known, then what is the probability that none of the first 4 have that value? The probability that one digit is not a given value is 9/10. Repeat that for all 4 of them, and the probability that all of them are different from the 5th is (9/10)^4 = 0.6561. (So your guess of 0.63 is pretty close!)

I had actually done this first; when I got the same answer the long way, I was sure it was right. That took several tries!

I added,

Don’t feel bad about “wasting time” with the slow method; as I’ve said, in this field, having two methods is good, and this way gave you more exposure to more ideas. That’s why I let you continue.

Extending the problem

Now, let’s experiment a little. What is the probability that the tenth number is new (different from the first nine)? By the same reasoning, the probability that none of the first 9 number is the value we are going to get for the tenth is $$\left(\frac{9}{10}\right)^9\approx0.3874\approx38.7\%.$$ This is higher than one might think (there is at least one unused number, so it will be at least 10%); but the important thing is that it is less than 50%.

Here is a sample of 20 sets of 10 random numbers; red marks tenth numbers that are different from any of the first nine:

The observed probability is 7/20, or 35%. This is close to the theoretical probability.

Now, how likely is it that the 20th random digit we make is one we haven’t seen yet? That will be $$\left(\frac{9}{10}\right)^{19}\approx0.1351\approx13.5\%.$$ Of course, it’s absolutely certain that we’ve had duplicates before then, since we can’t have more than 10 unique digits; this is just the chance that this one digit is not a duplicate.

The case method would have been essentially impossible for this bigger problem!