5/26/2015

## Review: Comparing Variables

In previous chapters, we looked for relationships (associations) between variables by:

• Comparing categorical variables with contingency tables and stacked barplots
• Comparing numeric variables across groups with side-by-side boxplots
• Looked at how variables change over time with timeplots

In this chapter, we will:

• Look for relationships between two numeric variables

## The Data

Recall the Motor Trend Cars data from previous chapters:

##                    mpg cyl disp  hp    wt  qsec vs     am
## Mazda RX4         21.0   6  160 110 2.620 16.46  V   auto
## Mazda RX4 Wag     21.0   6  160 110 2.875 17.02  V   auto
## Datsun 710        22.8   4  108  93 2.320 18.61  S   auto
## Hornet 4 Drive    21.4   6  258 110 3.215 19.44  S manual
## Hornet Sportabout 18.7   8  360 175 3.440 17.02  V manual
## Valiant           18.1   6  225 105 3.460 20.22  S manual

We might want to know:

• Is there a relationship between engine displacement (size) and horsepower?
• Is weight related to fuel efficiency?

## Overview

How to we find relationships between numeric (quantitative) variables?

• Visually: using scatterplots
• Numerically: using the correlation coefficient
• Usually, we do both
• In this course, we will only focus on linear relationships

## Scatterplots

How to make scatterplots:

• Define one variable as the $$X$$ variable, and one as $$Y$$
• Draw a point for each observation, using the values of the $$X$$ and $$Y$$ variables as coordinates
• Typically, the $$X$$ variable is on the horizontal axis and the $$Y$$ variable on the vertical axis

What we look for:

• Is there are trend or pattern?
• Are there any outliers or unusual points?

## Horsepower vs. Displacement

Is there a relationship?

• As engines get bigger, they tend to have more horsepower
• We call this a positive association

Are there any unusual points?

• There is a point well above the rest
• Notice that it's engine size is right in the middle (about 300 cu. in.), but its horsepower is larger than any other car

## Weight vs. Fuel Efficiency

Is there a relationship?

• As cars get heavier, they tend to have lower fuel efficiency
• We call this a negative association

Are there any outliers?

• No points fall far away from the rest

## Types of Relationships

There are many types of trends that can come up when we make scatterplots. In this class, we will focus on the most common:

• Linear: The trend can be described fairly well by a straight line
• Non-linear: Any other type of trend

Directions of Relationships

• Positive: As one variable goes up, so does the other one
• Negative: As one variable goes up, the other goes down

Why lines?

• In statistics, we often try to find the simplest adequate method. Lines are simple.

## Roles of Variables

How do we decide which is $$X$$ and which is $$Y$$?

The $$X$$ Variable is:

• The explanatory or independent variable.
• We want to know if changes in this variable explains changes in $$Y$$

The $$Y$$ Variable is:

• The response or dependent variable
• We want to see if this variable responds when we change $$X$$

Which is which depends on what question we're asking.

## Variable Role Examples

Horsepower vs. Engine Displacement

• It makes sense that giving a car a bigger engine gives it more power.
• We can't just give a car more horsepower, horsepower responds to changes we make to the car.
• Horsepower should be $$Y$$, and Engine Displacement should be $$X$$.

Fuel Efficiency vs. Weight

• When we make a car heavier, it should mean that it takes more fuel to move it.
• Fuel efficiency responds to changes in the properties of the car.
• Fuel Efficiency should be $$Y$$, and Weight should be $$X$$.

## Measuring the Strength

How do we measure how strong the relationship is?

• We use the correlation coefficient
• $$r = \frac{\sum z_y \times z_x}{n-1}$$
• StatCrunch will find this for us

What is the correlation coefficient?

• $$r$$ is the strength of the linear relationship between two numeric variables
• It tells us how well a straight line explains the relationship

## Interpreting Correlation

• $$-1 \le r \le 1$$
• The value of $$r$$ tells us the strength
• The sign of $$r$$ tells us the direction
• $$r = 1$$: the points make a perfect straight line with a positive slope
• $$r = -1$$: the points make a perfect straight line with a negative slope
• $$r = 0$$: there is no linear relationship at all

Notes:

• You can sometimes get high correlations even if the relationship isn't linear
• You should always see a scatterplot along with a correlation coefficient to know whether or not it's meaningful

## Using the Correlation Coefficient

So how do we use $$r$$?

• First, make a scatterplot
• There needs to be a linear association, or $$r$$ is meaningless
• Check the sign: is the relationship positive or negative?
• Check the value: how strong is the relationship?
• Are there outliers? The correlation is very sensitive to them.

Note:

• We often use the terms weak, moderate, and strong to describe the relationship, but these are up to interpretation.

## Outliers in Correlation

• $$r$$ is the strength of the overall linear relationship in the data
• If we have a point that is far away from the rest, it will decrease the strength of the relationship
• If we remove an outlier, it will drive $$r$$ away from 0 and towards -1 or 1
• If we add an outlier, it will drive $$r$$ towards 1

These relationships also hold if we alter a point's values (e.g., correct a typo in the data set)

• Moving a point towards the rest improves $$r$$
• Moving it away from the rest punishes $$r$$

## More Properties of Correlation

• $$r$$ is unitless
• $$r$$ is not affected by changes of center or scale
• If we change units, the correlation will not change (e.g., $$lbs \to kg$$)
• The correlation of $$X$$ and $$Y$$ is the same as the correlation between $$Z_x$$ and $$Z_y$$ (their z-scores)
• The correlation stays the same if we flip $$X$$ and $$Y$$
• Correlation only applies to relationships between numeric variables. If there is an association involving categorical variables, it is not correlation.

## In StatCrunch

Scatterplots:

1. Graph $$\to$$ Scatter Plot
2. X Column $$\to$$ Select your explanatory $$(X)$$ variable
3. Y Column $$\to$$ Selected your response $$(Y)$$ variable
4. Compute!

Correlation:

1. Stat $$\to$$ Summary Stats $$\to$$ Correlation
2. Select Column(s) $$\to$$ Hold Shift/Ctrl/Command to select multiple variables (note: if you select more than two variables, it will find all pair-wise correlations)
3. Compute!

## Correlation $$\ne$$ Causation

Must people are familiar with the phrase "correlation does not equal causation," but what does that really mean?

• Even if we find a correlation between two variables, it does not mean that one causes the other.
• This is especially common when two things both increase or decrease over time.
• Both may be caused by other, unknown variables.
• We call these unknown variables lurking variables or confounding variables.

For example:

• What if we looked at the correlation between national ice cream sales and the number of forest fires, recorded for each month of the year?