# Coming up with realistic data for linear regression examples

Whether it is for writing or for teaching, I am always in need of new and useful data sets. Since the GAISE report was released, everyone who teaches stats in any form is hearing over and over the importance of using REAL LIFE DATA and I agree that this is a good general practice. But sometimes, you need a data set that illustrates a specific idea or students need a simpler context to start with before tackling the (often) more complicated real life applications. I love going through a long real life application in class, but when it comes time for a quiz or a test, I just need to know that students can apply the basic techniques and explain concepts as they apply to the situation.

Let’s say that I am writing a new exam item and need some simple linear regression data set. Students are going to use their TI83 or 84 to get the correlation coefficient, the coefficient of determination, the equation for the line, and finally interpret these values (and things like the slope or y-intercept) in context.

I know I am not alone in this, so I will show you how we can get a reasonable, but not exactly real, data set for them to work with. My favorite tool for this is R (you don’t need to be an expert programmer for this!) but I figure other similar tools will work just as well.

First the context: Suppose a company thinks there may be a linear relationship between the amount they spend on advertising each month (in thousands of dollars) and the total monthly sales (also in thousands of dollars). (**insert instructions to student about performing regression etc**)

In an exam question, I wouldn’t want too many data values as this increases the chance of calculator typos (so tough to grade! is it really a typo? did they know what they were doing?). So, I will come up with 8 realistic looking advertising amounts. It’s too easy to accidentally have a pattern in data I think of the top of my head, so instead, I will use the rnorm function in R. This function let’s me generate random values from a normal distribution.

This function works with the following inputs:

rnorm(how many data values, mean, standard deviation)

To make this work, I did need to decide on a reasonable mean amount of money spent on advertising each month and a reasonable standard deviation. As you can see, I told it to give me 8 random normal values from a distribution with a mean of 10.6 and a standard deviation of 3.7. But, oops, I should probably round these. Since I will be using this data later, I will do the rounding in R.

There we go! Much better. Above, I used the round() function. I put in what I wanted to round as the first entry and how many decimals I wanted as the second entry. The next step is to get some good y-values (total sales) while keeping the linear relationship I would like. This requires another judgement call. I must decide what equation I should base these values on. I don’t know if there is one right answer here, but I often will do some googling to make it as realistic as possible.

For the sake of this example, I will just pick one here and say: y = 1.3x + 2.7. (where y is the sales for the month and x is the advertising spend; both in thousands). Since an exact fit will be very boring, I will add in some random error when calculating the sales values. For this one, I will go a little high with it by using a standard deviation of 3 (the mean should be zero).

For non-programmers, I will explain this code a bit.

The first line:

```sales = c(0,0,0,0,0,0,0,0)
```

is where I initialize the variable sales. This code sets up sales as a set of 8 zeros. Each zero will then be replaced by the values I calculate in the for loop shown below. (sidenote: technically, in R,  y is a vector and not simply “a variable”, but this distinction isn’t important here)

```for (i in 1:8){
sales[i] = 1.3*advertising[i] + 2.7 + rnorm(1,0,3)}```

With the for loop, I am telling R to take each entry of “advertising” (in R we start at an index of 1) and calculate a “sale” value using the equation I came up with along with a little error (adding the random normal value from rnorm). In the last line, you can see my resulting data.

```>sales
[1] 15.32 20.62 22.56 10.74 6.85 17.48 18.13 12.18```

From here, it is worth looking to see how it all comes out when students work with this on their calculators. As an exam question, it should be pretty routine – a decent fit and not too crazy looking on the scatterplot.

Pretty good. Notice that my intercept is a little different than planned due to the error I added, but that is to be expected. Here is the final product:

 Advertising Spend (thousands of dollars) 12.29 15.11 14.44 10.17 3.56 11.45 11.1 8.18 Total Sales (thousands of dollars) 15.32 20.62 22.56 10.74 6.85 17.48 18.13 12.18

Add in a story (what does the company make? what’s the company’s name? what is their motivation?) and you have a nice simple exam problem. This is also a great problem to talk about AFTER the exam. Can we expect that sales are always linearly related to spend? So if I spend more I will always sell more? These questions about extrapolation and the true application of a linear regression model are important in any statistics classroom and applying these techniques in real life.

# Understanding the common core: HSS.IC.B.6 use simulations to decide if differences between parameters are significant

Although I am a college professor, a great deal of my freelance writing involves working with the common core state standards. Most of this time, especially at the beginning, was spent trying to decipher exactly what skills the common core is after and how to best assess or address those skills.

A particularly tough to interpret group of standards are in the domain “making inferences and justifying conclusions“. These standards are focused on helping students develop that deep intuition with statistics based thinking. For example, a question like “a coin landed on tails 65 times out of 100 – is this enough to make us question if it is fair?” would be a part of this domain. All these standards require some really deep thinking on the part of students.

## HSS.IC.B.6

This standard states that students should be able to:

Use data from a randomized experiment to compare two treatments; use simulations to decide if differences between parameters are significant.

Many online resources out there are interpreting this as meaning that students should be able to use tools such as a 2-sample-t-test to compare two populations. Personally, I think this completely missed the mark of this entire domain of standards. At this level, it isn’t that we are expecting high school students to apply hypothesis testing or confidence interval calculations formally. Instead, we want them to start thinking about the meaning behind these procedures before they see them formally presented at the college level or in an AP stats course. These types of ideas will help the students have a much better idea of the p-value and the whole process of hypothesis testing itself, once these are introduced.

## An Example

Let’s use a typical question that would be aligned to this standard as a discussion tool. The data for this question and the resulting histogram were all generated in R (see the bottom of the post for code).

Suppose that two researchers want to determine if high school students that are offered encouraging remarks complete a difficult task faster, on average, than those who aren’t.  In order to test this, they select two random samples of 25 high school students each. The first group is asked to work on a difficult puzzle and offered no feedback as they work. The second group is asked to do the same but are also given encouraging comments such as “you almost got it” or “that’s a good idea” as they work. For the first group (no encouragement), the mean time to complete the puzzle was 28.1 minutes with a standard deviation of 6.7 minutes. For the second group, the mean time was 27.2 minutes with a standard deviation of 5.5 minutes.

In order to test the significance of this result, the researchers used a computer to randomly assign individual times to each group and then compute the new mean difference between the first and second groups. They then repeated this process 1,000 times and plotted all of the resulting differences on the plot below.

The question here might then ask students to determine if the observed difference between the means is statistically significant, or explain whether or not this should lead researchers to believe that those with encouragement will complete the task faster. Both deep/critical thinking types of questions that go beyond applying a formula.

Using the graph, we would hope that they would see that the observed difference of 28.1 – 27.2 = 0.9 minutes is within a range of values that is frequently observed when the groups are assigned randomly (it is not a rare difference – it came up a lot in simulation). Therefore, the experiment’s results are not statistically significant as they could be due to chance alone. Through resampling, they are able to see how the samples might behave if the differences WERE due to chance (as they were in the simulation).

As you can see, this type of question is indirectly having students think about a p-value and its implications without truly introducing these ideas formally. Certainly they could run a 2-sample-t-test or similar, but that would be robotic compared the critical thinking that the common core writers were hoping students would develop. The ultimate goal is to have students use computers or even physical simulation to understand uncertainty (such as using a special deck of cards, or even flipping coins) and as mentioned develop an intuition towards statistical thinking in general.

If you are finding yourself still trying to wrap your mind around this standard, you might find the following related articles interesting: Why Resampling is Better than Hypothesis Tests and Confidence Intervals which comments on a similar high school standard in New Zealand and Resampling Statistics which is an overview of techniques from East Carolina University.

#### R Code Used for This Example

```#create the data for the two groups
#sample means and standard deviations
#were calculated from these groups
no_encourage=rnorm(25,28.6,7.1)
encourage=rnorm(25,27.1,6.4)
#create the combined group
group=c(encourage, no_encourage)
#initialize difference vector
diff=1:1000
>#resample
for(i in 1:1000){
randomized=sample(group)
new_no_encourage=randomized[1:25]
new_encourage=randomized[26:50]
diff[i]=mean(new_no_encourage)-mean(new_encourage)
}
```