Coming up with realistic data for linear regression examples

Whether it is for writing or for teaching, I am always in need of new and useful data sets. Since the GAISE report was released, everyone who teaches stats in any form is hearing over and over the importance of using REAL LIFE DATA and I agree that this is a good general practice. But sometimes, you need a data set that illustrates a specific idea or students need a simpler context to start with before tackling the (often) more complicated real life applications. I love going through a long real life application in class, but when it comes time for a quiz or a test, I just need to know that students can apply the basic techniques and explain concepts as they apply to the situation.

Let’s say that I am writing a new exam item and need some simple linear regression data set. Students are going to use their TI83 or 84 to get the correlation coefficient, the coefficient of determination, the equation for the line, and finally interpret these values (and things like the slope or y-intercept) in context.

I know I am not alone in this, so I will show you how we can get a reasonable, but not exactly real, data set for them to work with. My favorite tool for this is R (you don’t need to be an expert programmer for this!) but I figure other similar tools will work just as well.

First the context: Suppose a company thinks there may be a linear relationship between the amount they spend on advertising each month (in thousands of dollars) and the total monthly sales (also in thousands of dollars). (**insert instructions to student about performing regression etc**)

In an exam question, I wouldn’t want too many data values as this increases the chance of calculator typos (so tough to grade! is it really a typo? did they know what they were doing?). So, I will come up with 8 realistic looking advertising amounts. It’s too easy to accidentally have a pattern in data I think of the top of my head, so instead, I will use the rnorm function in R. This function let’s me generate random values from a normal distribution.


This function works with the following inputs:

rnorm(how many data values, mean, standard deviation)

To make this work, I did need to decide on a reasonable mean amount of money spent on advertising each month and a reasonable standard deviation. As you can see, I told it to give me 8 random normal values from a distribution with a mean of 10.6 and a standard deviation of 3.7. But, oops, I should probably round these. Since I will be using this data later, I will do the rounding in R.


There we go! Much better. Above, I used the round() function. I put in what I wanted to round as the first entry and how many decimals I wanted as the second entry. The next step is to get some good y-values (total sales) while keeping the linear relationship I would like. This requires another judgement call. I must decide what equation I should base these values on. I don’t know if there is one right answer here, but I often will do some googling to make it as realistic as possible.

For the sake of this example, I will just pick one here and say: y = 1.3x + 2.7. (where y is the sales for the month and x is the advertising spend; both in thousands). Since an exact fit will be very boring, I will add in some random error when calculating the sales values. For this one, I will go a little high with it by using a standard deviation of 3 (the mean should be zero).


For non-programmers, I will explain this code a bit.

The first line:

sales = c(0,0,0,0,0,0,0,0)

is where I initialize the variable sales. This code sets up sales as a set of 8 zeros. Each zero will then be replaced by the values I calculate in the for loop shown below. (sidenote: technically, in R,  y is a vector and not simply “a variable”, but this distinction isn’t important here)

for (i in 1:8){
sales[i] = 1.3*advertising[i] + 2.7 + rnorm(1,0,3)}

With the for loop, I am telling R to take each entry of “advertising” (in R we start at an index of 1) and calculate a “sale” value using the equation I came up with along with a little error (adding the random normal value from rnorm). In the last line, you can see my resulting data.

[1] 15.32 20.62 22.56 10.74 6.85 17.48 18.13 12.18

From here, it is worth looking to see how it all comes out when students work with this on their calculators. As an exam question, it should be pretty routine – a decent fit and not too crazy looking on the scatterplot.

linear-regression-ti84-screenshot scatterplot-ti84

Pretty good. Notice that my intercept is a little different than planned due to the error I added, but that is to be expected. Here is the final product:

Advertising Spend
(thousands of dollars)
12.29 15.11 14.44 10.17 3.56 11.45 11.10 8.18
Total Sales
(thousands of dollars)
15.32 20.62 22.56 10.74 6.85 17.48 18.13 12.18

Add in a story (what does the company make? what’s the company’s name? what is their motivation?) and you have a nice simple exam problem. This is also a great problem to talk about AFTER the exam. Can we expect that sales are always linearly related to spend? So if I spend more I will always sell more? These questions about extrapolation and the true application of a linear regression model are important in any statistics classroom and applying these techniques in real life.