Recently, I found myself needing to create scatterplots that represented specific values for the correlation coefficient r. This was for a writing project, but it is something that has come up with teaching as well. Showing students scatterplots for many different values of r seems to really help them conceptually, especially when it comes to understanding that not every data set with the same correlation will look exactly the same. Unfortunately, I have always been at the mercy of what examples I can find online or in textbooks. With this in mind, I set out to figure this problem out once and for all.
The problem: Given a desired correlation coefficient, generate a data set.
As it turns out, this is not that difficult of a problem! Using this overall solution, I wrote a simple function in R.
make_data_corr = function(corr, n){ x = rnorm(n,0,1) y = rnorm(n,0,1) a = corr/(1-corr^2)^0.5 z=a*x+y the_data = data.frame(x,z) return(the_data) }
The inputs here are corr (the desired value for the correlation coefficient) and n (the desired number of paired data values). You will notice that I didn’t add any kind of validation or anything like that to this function, so if you put in a strange value for r or n, you are on your own. The resulting output is a data frame with your data set being x and z. Here is an example of it in action:
example=make_data_corr(0.85,35) plot(example$x,example$z)
At smaller sample sizes, the correlation coefficient is CLOSE but not exact. Here, r = 0.92 but when I ran the function again with n = 350 I ended up with r = 0.83. For my purposes this is good enough, but it is a consideration for possible improvements (at this stage, I haven’t thought about how to approach this).
Eventually I may make this into a small webapp that anyone can use (including myself). Until then, if you find a use for this or find a way to make this better, certainly let me know. It is an interesting little problem to play with!
Dear Jeremi,
this is a quite nice formula to generate data with a specific correlation coefficient. I have a question: I would like to create a dataset where the relationship of social status of parents and choosing further school for their children is in a certain correlation.
Therefore are three values for the social status (1=lower, 2=middle, 3=upper class) and three values for school type 1, type 2 and type 3.
Would you know a solution for creating such a dataset in R?
Kind regards,
Guenter
Off the top of my head, I do not know any “easy” method. That doesn’t mean it is impossible though – but that a solution probably would take some work. I’ll leave this comment here so perhaps someone else can chime in! Interesting problem though for sure!