# Multiple Linear Regression

*Linear Regression* is using a straight line to describe a relationship between variables. It finds the best possible fit through minimizing “error” of the model.
**Simple linear regression** uses one independent variable and one dependent variable and **multiple linear regression** uses more than one independent variable with one dependent variable. **Multi variate linear regression** uses more than one dependent variable. I most often use multiple linear regression so I’ll discuss that here.

### 1. Load your packages

```
library(ggplot2)
library(dplyr)
library(broom)
library(ggpubr)
library(moderndive)
library(readr)
library(readxl)
library(skimr)
library(tidyverse)
```

### 2. Load the data

```
data = read_xls("PS206767-553247439.xls", sheet = 2)
data = data %>%
select(1, 4, 10, 39) %>%
mutate(height = as.numeric(`Height (cm)`),
dose = as.numeric(`Therapeutic Dose of Warfarin`),
gender = as.factor(Gender)) %>%
select(1, 5:7) %>%
filter(gender != "NA") %>%
drop_na()
```

`data`

should have appeared in your `Global Environment`

(in the upper right side of the RStudio screen)

After you load data, always check it out to make sure you called what you meant to…I like `skim`

or `glimpse`

for this

`skim(data) `

Name | data |

Number of rows | 4443 |

Number of columns | 4 |

_______________________ | |

Column type frequency: | |

character | 1 |

factor | 1 |

numeric | 2 |

________________________ | |

Group variables | None |

**Variable type: character**

skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|

PharmGKB Subject ID | 0 | 1 | 11 | 11 | 0 | 4443 | 0 |

**Variable type: factor**

skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|

gender | 0 | 1 | FALSE | 2 | mal: 2637, fem: 1806, NA: 0 |

**Variable type: numeric**

skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|

height | 0 | 1 | 168.06 | 10.84 | 124.97 | 160.02 | 167.89 | 176.02 | 202 | ▁▂▇▆▁ |

dose | 0 | 1 | 31.23 | 17.06 | 2.10 | 20.00 | 28.00 | 39.97 | 315 | ▇▁▁▁▁ |

`glimpse(data)`

```
Rows: 4,443
Columns: 4
$ `PharmGKB Subject ID` <chr> "PA135312261", "PA135312262", "PA135312263", "P…
$ height <dbl> 193.040, 176.530, 162.560, 182.245, 167.640, 17…
$ dose <dbl> 49, 42, 53, 28, 42, 49, 42, 71, 18, 17, 97, 49,…
$ gender <fct> male, female, female, male, male, male, male, m…
```

*Check for correlation*

`data %>% get_correlation(dose ~ height, na.rm = T)`

```
# A tibble: 1 x 1
cor
<dbl>
1 0.315
```

Just to get an idea of the linear relationship….there is about 30% correlation here between our outcome variable and the continous predictor, height.

### 3. Check your assumptions

There are a number of assumptions that need to be correct in order to properly use multiple regression.

**Independence of observations**

Independence of observations means each participant is only counted as one observation. The statistical assumption of independence of observations stipulates that all participants in a sample are only counted once. This is satisfied if we’re using one observation per person.

**Normality**

Use `hist()`

to test the *dependent* varible for a normal distribution.

`hist(data$dose)`

This distrubtion is a lil’ skewed and we should log transform the data.

`data$outcome = log(data$dose)`

and re-plot…

This plot is much more bell-shaped and we can move on.

**Linearity**should be assessed using scatterplots of the outcome (dependent) variable with each of the included predictors (independent variables).

```
data %>% filter(gender != "NA") %>% ggplot(aes(outcome, height, group = gender, color = gender)) + geom_point() +
geom_parallel_slopes(se = F) +
theme_bw()
```

In this plot we’ve visualized both the numeric predictor variable, height, and the factor predictor, gender, by setting the aesthetics for each of these variables.

The fourth assumption is equal variance among the residuals…which we will test after running the model by plotting residuals.*Homoscedasticity*

#### 4. Run the model

Linear regression syntax is super straighforward in R. The model is given as `outcome ~ predictor + predictor`

and you can call it with the `lm()`

function. I’ve included the wrapper `summary()`

so we can get the statistics back from the regression. Alternatively, use the `moderndive`

package and `get_regression_table()`

to get the same output.

```
fit = lm(data = data, outcome ~ gender + height)
summary(fit)
```

```
Call:
lm(formula = outcome ~ gender + height, data = data)
Residuals:
Min 1Q Median 3Q Max
-1.95202 -0.28410 0.03265 0.32105 2.78849
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.7451039 0.1385046 -5.38 7.85e-08 ***
gendermale -0.2720412 0.0190291 -14.30 < 2e-16 ***
height 0.0250611 0.0008626 29.05 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4912 on 4440 degrees of freedom
Multiple R-squared: 0.163, Adjusted R-squared: 0.1626
F-statistic: 432.3 on 2 and 4440 DF, p-value: < 2.2e-16
```

or use `moderndive`

…

`get_regression_table(fit)`

```
# A tibble: 3 x 7
term estimate std_error statistic p_value lower_ci upper_ci
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 intercept -0.745 0.139 -5.38 0 -1.02 -0.474
2 gendermale -0.272 0.019 -14.3 0 -0.309 -0.235
3 height 0.025 0.001 29.1 0 0.023 0.027
```

#### 5. Plot Residuals

Super important to come back and check homoscedasticity (equal variance) of the residuals. Do this by plotting the same `lm()`

used previously. Residuals are calculated from model error and you can think about them as unexplained variance.

```
par(mfrow = c(2,2)) ## show 4 plots at once
plot(fit) ## plot residuals
```

What you want here is, basically, horizontal lines centered around 0 for each of the plots except the Q-Q plot which should follow along the theoretical quantiles.

```
data$predicted <- predict(fit) # Save the predicted values
data$residuals <- residuals(fit) # Save the residual values
# Quick look at the actual, predicted, and residual values
data %>% select(outcome, predicted, residuals) %>% head()
```

```
# A tibble: 6 x 3
outcome predicted residuals
<dbl> <dbl> <dbl>
1 3.89 3.82 0.0712
2 3.74 3.68 0.0587
3 3.97 3.33 0.641
4 3.33 3.55 -0.218
5 3.74 3.18 0.554
6 3.89 3.44 0.453
```

Here’s another, much prettier plot of residuals.

```
data %>%
sample_frac(.5) %>%
gather(key = "gender", value = "x", -outcome, -predicted, -`PharmGKB Subject ID`, -residuals, -dose, -height) %>% # Get data into shape
ggplot(aes(x = height, y = outcome)) +
geom_segment(aes(xend = height, yend = predicted), alpha = .2) +
geom_point(aes(color = residuals)) +
scale_color_gradient2(low = "blue", mid = "white", high = "red") +
guides(color = FALSE) +
geom_point(aes(y = predicted), color = "gray", alpha = .5) +
facet_grid(~ x, scales = "free_x") + # Split panels here by `iv`
theme_gray()
```

That’s really the basics of performing a linear regression in R.