Generate a data set with a given correlation coefficient

Recently, I found myself needing to create scatterplots that represented specific values for the correlation coefficient r. This was for a writing project, but it is something that has come up with teaching as well. Showing students scatterplots for many different values of r seems to really help them conceptually, especially when it comes to understanding that not every data set with the same correlation will look exactly the same. Unfortunately,  I have always been at the mercy of what examples I can find online or in textbooks. With this in mind, I set out to figure this problem out once and for all.

The problem: Given a desired correlation coefficient, generate a data set.

As it turns out, this is not that difficult of a problem! Using this overall solution, I wrote a simple function in R.

make_data_corr = function(corr, n){
x = rnorm(n,0,1)
y = rnorm(n,0,1)
a = corr/(1-corr^2)^0.5
z=a*x+y
the_data = data.frame(x,z)
return(the_data)
}

The inputs here are corr (the desired value for the correlation coefficient) and n (the desired number of paired data values). You will notice that I didn’t add any kind of validation or anything like that to this function, so if you put in a strange value for r or n, you are on your own. The resulting output is a data frame with your data set being x and z. Here is an example of it in action:

example=make_data_corr(0.85,35)
plot(example$x,example$z)

scatterplot-RAt smaller sample sizes, the correlation coefficient is CLOSE but not exact. Here, r = 0.92 but when I ran the function again with n = 350 I ended up with r = 0.83. For my purposes this is good enough, but it is a consideration for possible improvements (at this stage, I haven’t thought about how to approach this).

Eventually I may make this into a small webapp that anyone can use (including myself). Until then, if you find a use for this or find a way to make this better, certainly let me know. It is an interesting little problem to play with!

2 response on “Generate a data set with a given correlation coefficient

  1. Dear Jeremi,
    this is a quite nice formula to generate data with a specific correlation coefficient. I have a question: I would like to create a dataset where the relationship of social status of parents and choosing further school for their children is in a certain correlation.
    Therefore are three values for the social status (1=lower, 2=middle, 3=upper class) and three values for school type 1, type 2 and type 3.
    Would you know a solution for creating such a dataset in R?
    Kind regards,
    Guenter

    • Off the top of my head, I do not know any “easy” method. That doesn’t mean it is impossible though – but that a solution probably would take some work. I’ll leave this comment here so perhaps someone else can chime in! Interesting problem though for sure!

Leave a Reply