Generate a data set with a given correlation coefficient

Recently, I found myself needing to create scatterplots that represented specific values for the correlation coefficient r. This was for a writing project, but it is something that has come up with teaching as well. Showing students scatterplots for many different values of r seems to really help them conceptually, especially when it comes to understanding that not every data set with the same correlation will look exactly the same. Unfortunately,  I have always been at the mercy of what examples I can find online or in textbooks. With this in mind, I set out to figure this problem out once and for all.

The problem: Given a desired correlation coefficient, generate a data set.

As it turns out, this is not that difficult of a problem! Using this overall solution, I wrote a simple function in R.

make_data_corr = function(corr, n){
x = rnorm(n,0,1)
y = rnorm(n,0,1)
a = corr/(1-corr^2)^0.5
the_data = data.frame(x,z)

The inputs here are corr (the desired value for the correlation coefficient) and n (the desired number of paired data values). You will notice that I didn’t add any kind of validation or anything like that to this function, so if you put in a strange value for r or n, you are on your own. The resulting output is a data frame with your data set being x and z. Here is an example of it in action:


scatterplot-RAt smaller sample sizes, the correlation coefficient is CLOSE but not exact. Here, r = 0.92 but when I ran the function again with n = 350 I ended up with r = 0.83. For my purposes this is good enough, but it is a consideration for possible improvements (at this stage, I haven’t thought about how to approach this).

Eventually I may make this into a small webapp that anyone can use (including myself). Until then, if you find a use for this or find a way to make this better, certainly let me know. It is an interesting little problem to play with!

Leave a Reply