Category: R

Pin codes and the birthday problem

In an introductory statistics class, there are always tons of ways to link back up to the “real world” – its one of the great things about the topic! One of my favorite things to talk about in class is the concept of bank pin codes. How many are possible? Are they really randomly distributed? How many people need to get together before you can be sure two people have the same pin code? So many questions and all of them can be easily explored!

My movie idea: the pigeonhole principle

Our romantic lead is at a baseball game, rushing between innings to the ATM. On the way, he bumps into a young woman and they drop all the stuff they’re carrying. Apologetic, he helps her up while gathering both of their things and is off to get his cash. It’s only later that he realizes that he accidentally used her ATM card! Not only that – he has some strange charges on his bank statement too! They must have mixed up cards… but is it really possible they share the same code? If so… they are clearly meant to be together…

[Naturally the rest of the movie is her thinking that he stole the card and then eventually them falling in love. At some point, a random person will vaguely explain the pigeonhole principle and give the movie its name – no stealing my idea]

Could this really happen? Yes! In fact, as we will see later, you probably wouldn’t even need the pair to be at a baseball game to make it reasonably likely. Before we get to that though, let’s look at the basic math.

How many pin codes are possible?

We will work with pin codes are 4 digits long and can be any whole number 0 – 9. In general, the multiplication principle states that if you can break an action into steps, then the number of ways you can complete the action is found by multiplying the number of ways you can complete each step. For pin codes, we can think of selecting each digit as a step. There are 10 choices for each step so:

10 \times 10 \times 10 \times 10 = 10\,000

This shows there are 10,000 possible pin codes.

How many people do you need to get together to guarantee at least two share a pin code?

The pigeonhole principle is a really simple idea that can be used to prove all kinds of things, from the complex to the silly. This rule states that if you have n pigeonholes but n + 1 (or more) pigeons, then at least 2 pigeons are sharing a pigeonhole. Another way to think about it: if a classroom has 32 chairs, but the class has 33 students… well, someone is sitting on someone’s lap (and class just got weird).

For pin codes, this means that you only need 10,001 people together to guarantee that at least two share the same pin code. You can imagine that if you were handing out pin codes, you would run out at the 10,000th person. At that point, you would have to give them the same code as someone else.

There are definitely more than 10,000 people at a baseball game –So the movie idea works! I’m going to be a millionaire!  Ok back to math…

Simulations show you need way fewer than 10,000 people

Everything we have talked about so far is based on the idea that pin codes are randomly selected by people. That is, we are assuming that any pin code has the same chance of being used by a person as any other. For now, we will continue that assumption but as you can imagine, people definitely don’t behave this way.

The question

If we randomly assign pin codes, how many assignments (on average) will there be before there is a repeated code? We know by the pigeonhole principle that it is guaranteed after 10,000. But what is the typical number? For sure, through randomness, it often takes less than 10,000 right?

The simulation

For the sake of ease with writing the code, we will assign each pin code a whole number 1 – 10,000. So you can imagine the code 0000 is 1, the code 0001 is 2, and the code 9999 is 10,000. This way, assigning a pin code is really just assigning a random number from 1 to 10,000. In fact, we can now state the problem as:

“How many random selections from {1,2,…,10000} until there is a repeated value?”

Let’s look at the code!

pin_codes=1:10000          #this is the set 1 through 10,000
selected=rep(0,10000)      #placeholder for selected pins
count = 0                  
i = 1                      
repeat{                    #loop that keeps selecting pin codes
  pin = sample(pin_codes,1, replace=TRUE) #select the pin
  if (pin %in% selected){   #if already selected then stop
     break 
     }
selected[i]=pin             #put the selected pin into "selected"
count = count +1            #keep track of how many you selected
i = i+1}
count                       #type this to see how many were selected before a repeat

If you copy and paste this code into R, you might be surprised. My results the first three times were: 120, 227, 41.

This seems to suggest that through random selection, it only took assigning 120 pin codes before a repeat (in the first trial), 227 (in the second) and the in the last trial – it only took 41 times! This can’t be right?!

Loop it

Maybe through randomness, we just had some unusual trials. Running 500 or 1000 trials should show the overall trend. The code below is the same (almost) but I wrote a loop around it so that it would repeat the same experiment 500 times. If you try this on your computer note that it is a little slow (I didn’t consider efficiency at all when writing this).

pin_codes=1:10000
counts=rep(0,500)
for (i in 1:500){
  selected=rep(0,10000)
  count = 0
  j = 1
    repeat{
      pin = sample(pin_codes,1,replace=TRUE)
          if (pin %in% selected){
             break}
      selected[j]=pin
      count = count +1
      j = j+1}
counts[i]=count}

Type in “mean(counts)” and it will give us the mean number of times that pin codes were randomly assigned before a repeat. The result?

mean(counts)
[1] 124.486

This tells us that on average, it only takes about 124 assignments before you see a repeat [1]. This is waaaay less than 10,000. What is going on?!

The birthday problem

A famous probability question is known as “the birthday problem“. Suppose that birthdays are equally likely to occur on any given day of the year (and include leap years). This means that there are 366 possible birthdays. By the pigeonhole principle, that means you are guaranteed to have two people share the same birthday as soon as you get 367 people together. But, the probability is almost 100% at only 70 people (and 50% at 23 people). This is another very counter-intuitive result.

Pin codes and birthdays?

This pin code problem is really the same as the birthday problem, just a little bit bigger or more generalized. Just imagine that there are 10,000 possible birthdays – we are looking at the number of people you need to have two people with the same birthday under this much larger set. Realizing this and researching a bit, you find that this is something that has been studied and in fact, the expected number of selections needed would be:

1 + \displaystyle\sum\limits_{k=1}^{10,000}\,\dfrac{10,000!}{(10,000-k)!10,000^k}

Plugging that mess into wolfram alpha gives the result:

1 + \displaystyle\sum\limits_{k=1}^{10,000}\,\dfrac{10,000!}{(10,000-k)!10,000^k} \approx 1 + 124.932 = 125.932

Look at that – even over just 500 trials, our simulation was really close. It only takes about 125 to 126 selection (on average) before you see a repeated pin code (assuming they are randomly selected).

126 – thats it!

But pin codes aren’t random

This is more true than you realize. If you look at the datagenetics article, you will see that instead of making up 1 of 10,000 (or 0.01%), the pin 1234 appears to actually make up more than 10% of codes. Just for fun, I went ahead and changed the code to account for the probability of the top 20. I then assigned all the remaining pin codes the remaining probability evenly – though this isn’t perfect as uncommon pin codes are REALLY uncommon.

pr=rep((1-0.2683)/9980,9980)
pin_codes=1:10000
counts=rep(0,500)
for (i in 1:500){
selected=rep(0,10000)
count = 0
j = 1
repeat{
pin = sample(pin_codes,1,replace=TRUE, prob = c(0.10713,0.06016,0.01881,0.01197,0.00745,0.00616,0.00613,0.00526,0.00516,0.00512,0.00451,0.00419,0.00395,0.00391,0.00366,0.00304,0.00303,0.00293,0.00290,0.00285,pr))
if (pin %in% selected){
break}
selected[j]=pin
count = count +1
j = j+1}
counts[i]=count}

This runs a loop of selecting pins until there is a repeat and then repeats that process 500 times (using these new probabilities for the first 20 codes). Checking the mean after one run I have:

mean(counts)
[1] 12.8

Even crazier! Considering how pin codes are not truly random at all, it looks like you would really only need around 12 to 13 people to have a repeat. Remember – there are 10,000 possibilities in general! ([2] – comment below on small correction made here)

Summary

Here are the numbers all together:

  • By the pigeonhole principle – you are guaranteed that two people share a pin code if the group is larger than 10,000. BUT:
    • If pin codes are randomly distributed: ~126 people (on average) are needed before two share a pin code
    • Using just a little of the data on the true distribution: Only ~13 people [2] (again, on average) are needed before two share a pin code!

Including more of the data we have on the distribution would probably bring that number down even further. As you can see, it is always very interesting to compare the theory to the reality.

Notes

[1] my code counts how many were selected and then stops counting when a repeat is encountered. So, it is really off by 1 from the expected number that you would select to have a repeat. This is a minor technicality overall but worth noting when you see the expected value formula which adds 1 to the sum.

[2] My original code had one of the probabilities as 0.0516 instead of 0.00516 and the mean after several runs was generally around 12. Fixing this probability, the mean seems to be a bit closer to 13 with several runs resulting in means of 12.5 to 12.8. It seems the top codes are really dominating the selection. It would be interesting to code in the details about the less likely pin codes (since they have a very tiny probability of being selected) and seeing if this is actually lower or not.

 

 

 

 

 

Generate a data set with a given correlation coefficient

Recently, I found myself needing to create scatterplots that represented specific values for the correlation coefficient r. This was for a writing project, but it is something that has come up with teaching as well. Showing students scatterplots for many different values of r seems to really help them conceptually, especially when it comes to understanding that not every data set with the same correlation will look exactly the same. Unfortunately,  I have always been at the mercy of what examples I can find online or in textbooks. With this in mind, I set out to figure this problem out once and for all.

The problem: Given a desired correlation coefficient, generate a data set.

As it turns out, this is not that difficult of a problem! Using this overall solution, I wrote a simple function in R.

make_data_corr = function(corr, n){
x = rnorm(n,0,1)
y = rnorm(n,0,1)
a = corr/(1-corr^2)^0.5
z=a*x+y
the_data = data.frame(x,z)
return(the_data)
}

The inputs here are corr (the desired value for the correlation coefficient) and n (the desired number of paired data values). You will notice that I didn’t add any kind of validation or anything like that to this function, so if you put in a strange value for r or n, you are on your own. The resulting output is a data frame with your data set being x and z. Here is an example of it in action:

example=make_data_corr(0.85,35)
plot(example$x,example$z)

scatterplot-RAt smaller sample sizes, the correlation coefficient is CLOSE but not exact. Here, r = 0.92 but when I ran the function again with n = 350 I ended up with r = 0.83. For my purposes this is good enough, but it is a consideration for possible improvements (at this stage, I haven’t thought about how to approach this).

Eventually I may make this into a small webapp that anyone can use (including myself). Until then, if you find a use for this or find a way to make this better, certainly let me know. It is an interesting little problem to play with!