Reading Data
Let's see what insights we can glean from a familiar dataset, mtcars. Recall that mtcars is a dataset built in to R. We can access it as follows:
data <- datasets::mtcars
We can alternatively just type:
mtcars
But what if we want to work with a dataset that is not built in to R for academic purposes? This will likely be the case in any real-world data analysis project. The easiest way to read in data is through reading a csv.
You can download the mtcars dataset as a csv file here.
Now save this file somewhere on your computer. We can read in the data as follows:
mtcars <- read.csv('path_to_file')
The required argument to the read.csv function is a relative filepath. To avoid unnecessary headaches, familiarize yourself with
getwd()
and setwd().
Exploratory Data Analysis
It's worthwhile to familiarize yourself with a few quick functions that require little effort to run but say quite a bit about the data:
str
, summary
, head
, and tail
> str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
> summary(mtcars)
mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
Median :3.695 Median :3.325 Median :17.71 Median :0.0000
Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
am gear carb
Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :4.000 Median :2.000
Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> tail(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2
Another key tool of exploratory data analysis (EDA) is plotting, which we touched upon last time.
Moreover, there are several functions that can help characterize data. We'll look at a few basic ones here, which give characteristics of 1-dimensional data:
> mean(mtcars$mpg)
[1] 20.09062
> min(mtcars$wt)
[1] 1.513
> median(mtcars$cyl)
[1] 6
These are nice, but understanding relationships between variables is really what statistics is all about. Let's look at pairwise correlations between pairwise variables.
> cor(mtcars)
mpg cyl disp hp drat wt
mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.68117191 -0.8676594
cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958
disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799
hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479
drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406
wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000
qsec 0.4186840 -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159
vs 0.6640389 -0.8108118 -0.7104159 -0.7230967 0.44027846 -0.5549157
am 0.5998324 -0.5226070 -0.5912270 -0.2432043 0.71271113 -0.6924953
gear 0.4802848 -0.4926866 -0.5555692 -0.1257043 0.69961013 -0.5832870
carb -0.5509251 0.5269883 0.3949769 0.7498125 -0.09078980 0.4276059
qsec vs am gear carb
mpg 0.41868403 0.6640389 0.59983243 0.4802848 -0.55092507
cyl -0.59124207 -0.8108118 -0.52260705 -0.4926866 0.52698829
disp -0.43369788 -0.7104159 -0.59122704 -0.5555692 0.39497686
hp -0.70822339 -0.7230967 -0.24320426 -0.1257043 0.74981247
drat 0.09120476 0.4402785 0.71271113 0.6996101 -0.09078980
wt -0.17471588 -0.5549157 -0.69249526 -0.5832870 0.42760594
qsec 1.00000000 0.7445354 -0.22986086 -0.2126822 -0.65624923
vs 0.74453544 1.0000000 0.16834512 0.2060233 -0.56960714
am -0.22986086 0.1683451 1.00000000 0.7940588 0.05753435
gear -0.21268223 0.2060233 0.79405876 1.0000000 0.27407284
carb -0.65624923 -0.5696071 0.05753435 0.2740728 1.00000000
TMI? - Try a scatterplot matrix
> plot(mtcars[,c("mpg", "wt", "hp")])
Performing a Linear Regression
Regressions are one of the best features of R. No software in existence has an easier or more intuitive approach to regression.
Single-Variable Regression
> fit1 <- lm(mpg ~ wt, data = mtcars)
> summary(fit1)
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
Multivariate Regression
> fit2 <- lm(mpg ~ wt + hp, data = mtcars)
> summary(fit2)
Call:
lm(formula = mpg ~ wt + hp, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.941 -1.600 -0.182 1.050 5.854
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.22727 1.59879 23.285 < 2e-16 ***
wt -3.87783 0.63273 -6.129 1.12e-06 ***
hp -0.03177 0.00903 -3.519 0.00145 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.593 on 29 degrees of freedom
Multiple R-squared: 0.8268, Adjusted R-squared: 0.8148
F-statistic: 69.21 on 2 and 29 DF, p-value: 9.109e-12
Plotting Regression Results
Diagnostic Plot 1 - Fitted vs Actual
Perhaps the most important and intuitive of any diagnostic plot. How close were we?
> plot(mtcars$hp, mtcars$mpg) > points(mtcars$hp, fit1$fitted.values, col="red")
Diagnostic Plot 2 - Scatterplot of Residuals
Robert Stine thinks that this is the most important plot. It's also kind of nice. Lets you check for heteroskedasticity, and unexplained structure.
> plot(fit1$residuals) > abline(h=0, col="blue")An extension of this is the histogram of the residuals (more effective when more data)
> hist(model1$residuals, breaks=10) > abline(v=0, col="blue")
Not shown are other important diagnostic plots like residuals on lagged residuals, etc. Hopefully, the code above should help you figure out how to do this in R, but of course you can always just email us and ask for help.
Random Sampling
R is great at analyzing data -- but it's also great at generating data. We're about to look into some of the tools R offers, starting with the most basic:
Sampling without replacement
Say we have marbles numbered 1 through 10. We want to randomly choose 5 of them out of a bowl. We can simulate this process as follows:
> vect <- c(1:10)
> sample(vect, 5)
[1] 1 10 2 4 7
> sample(vect, 5)
[1] 2 6 3 5 9
> sample(vect, 5)
[1] 5 4 3 9 10
What if we want to draw 11 marbles? Obviously it won't work since there are only 10 values to draw from!
> sample(vect, 11)
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
Sampling with replacement
What if after each time we draw a marble, we put it back into the bowl. It is possible that we draw the sample marble multiple times. This is sampling with replacement, and is quite easy in R.
sample(vect, 11, replace = TRUE)
[1] 7 5 2 9 3 4 9 7 4 10 1
Continuous Distributions
Above, we sampled from a discrete uniform distribution. But what if we want to sample from a distribution with continuous values? It would be impossible to list every possible real number between 0 and 1, for example. Fortunately, R comes built-in with several probability distributions from which you can sample.
Say we want 50 observations from a uniform distribution between 1 and 10,000. Enter "runif" -- i.e. r (random) + unif (uniform distribution)
> samples <- runif(50, 1, 10000)
> head(samples)
[1] 5930.184 4875.951 6468.918 4097.659 8969.386 7580.914
We can also sample from more interesting distributions, like the normal distribution.
> normal_samples <- rnorm(5, 0, 1)
> normal_samples
[1] 0.2137312 -0.9722811 -1.0733392 0.2904004 -1.2296574
Here we specify that we want 5 samples from a normal distribution with mean 0 and standard deviation 1.
To understand the syntax behind any function call, including rnorm, simply type
> ?rnorm
Monte Carlo Simulation Example
Example 1
What is the distribution of means of a random sample of 10 values from a discrete uniform distribution of integers between 1 and 10,000?
vect <- c(1:1000)
> ntrials <- 10000
> results <- numeric(ntrials)
> for (i in 1:ntrials) {
+ results[i] <- mean(sample(vect, 10))
+ }
> mean(results)
[1] 499.6885
> sd(results)
[1] 91.2873
> hist(results)
Example 2
We roll a 6-sided die three times. What might we expect the difference between the smallest and largest rolls to be?
> ntrials <- 10000
> results <- numeric(ntrials)
> for (i in 1:ntrials) {}
> ntrials <- 10000
> results <- numeric(ntrials)
> for (i in 1:ntrials) {
+ rolls <- sample(c(1:6), 3, replace = TRUE)
+ results[i] <- max(rolls) - min(rolls)
+ }
> mean(results)
[1] 2.9236
> sd(results)
[1] 1.342737
> hist(results)