The Wharton School | Groups

Loading
Loading

 

Getting Started with R

Some inspirational words of wisdom to get us started:
A journey of a thousand miles begins with a single step - Laozi

Why Bother Learning R?

  • R is powerful -- it blows Excel completely out of the water in the calculations it can do.
  • R is replicable -- R is like a recipe, Excel is like a fully served meal on your plate. It's very difficult to retrace the steps that someone took to get the spreadsheet you see.
    • And if they changed values by hand rather than through formulas? Forget it.
  • R is a single-minded statistics machine
    • "Do one thing and do it well" - Doug McIlroy
    • Ever try to do a regression in R? Literally shorter than a tweet
    • Ever try to do a regression in Python? At least 10 lines of code (using packages)

Installing

  • The R Language (the actual programming language) can be found at:
    • Windows: https://cloud.r-project.org/bin/windows/base/R-3.5.1-win.exe
    • Mac: https://cloud.r-project.org/bin/macosx/R-3.5.1.pkg
  • RStudio -- a free Interactive Development Environment (IDE) -- can be found at:
    • https://www.rstudio.com/products/rstudio/download/#download

Baby Steps

Once you've got both programs installed, try running RStudio. In keeping with tradition, you can try your very first program by typing in:
print("Hello, World!")

 

Data Types

Data types exist to help programmers like us from making serious mistakes. For example, the following piece of code works perfectly:

> 2 + 2
[1] 4

But the following breaks:

> 2 + "2"
Error in 2 + "2" : non-numeric argument to binary operator

That's because:

> typeof(2)
[1] "double"

But:

> typeof("2")
[1] "character"

Storing Values - Variables


Being able to perform simple computations like 2 + 2 is nice; but what if we want to do store the result and do something with it later?  That's where variables come into play.

To assign value to a variable, R uses the left arrow operator, or the equals sign.

> var1 <- "hello"
> print(var1)
[1] "hello"


> var1 = "hello"
> print(var1)
[1] "hello"


Both forms are perfectly functional, but the left arrow operator is generally preferred.
 

A Series of Things - Vectors

Code can become extremely long and tedious if we need to assign an individual variable for each value we want to store.

Vectors provide a crucial tool for storing and manipulating data in R.  A vector contains elements of the same data type (e.g. logical, integer, character, etc.).  Vectors are formed using the c() function (which stands for "concatenate."

> vector1 <- c(1, 2, 3, 4) #OK!
> vector2 <- c("U", "S", "S") #OK!
> vector3 <- c(1, "A", TRUE) #BAD!


Vectors are an incredibly powerful data structure.  In addition to storing data efficiently, they are built for very fast and convenient operations.

> vector1 <- c(1, 2, 3, 4)
> vector2 <- vector1 * 2
> print(vector2)


It is easy to manipulate or extract subsets of vectors too!  See what happens if you run the lines below:

> v1 <- c('a', 'b', 'c', 'd')
> print(v1[1])
> print(v1[3:4])
> v1[4] <- 'e'
> print(v1[3:4])

 

Example: Payday

> week <- c("sun", "mon", "tue", "wed", "thu", "fri", "sat")
> year <- rep(week, 52)
> length(year)
[1] 364
> tail(year)
[1] "mon" "tue" "wed" "thu" "fri" "sat"
> year <- c(year, "sun")
> tail(year)
> year[15]
> year[30]
> year[c(15, 30)]
> seq(0, 365, 15)
> paydays <- year[seq(0, 365, 15)] 

 

The Holy Grail - Dataframes

90% of the data you work will be in the form of a dataframe, and the other 10% you want to change into a dataframe.

Why? Because they're incredibly useful, and most of the data you work with comes in standard relational database format

Edgar Codd invented relational databases back in 1969, and things have tended to follow his model ever since.
  • Each row is a record or observation
  • Each column is a unique attribute
  • A given attribute for a given record should be atomic. R lets you fudge this, but promise me you won't

Seeing is Believing

corndata: https://www.dropbox.com/s/1a9fb3jt40imtdv/corndata.csv?dl=1
parking tickets: https://www.dropbox.com/s/remmpqrjjgq7ie1/verysmallparking.csv?dl=1

Use the "Import From Text (readr)" option.
 

Lorenz Curves in Corn

To make this recipe, you will need:

> s_corn <- corn[order(corn$value),]
> total_corn = sum(corn$value)
> cumsum(s_corn$value) #This is the cumulative sum of total corn produced
> nrow(s_corn)


Try it yourself, but just in case, the answer is right below this line spoilered out.
plot((1:nrow(corn))/nrow(corn), cumsum(s_corn$value)/sum(corn$value))


Parking Tickets on the Weekend

I have a theory that there are more parking tickets issued on the weekend. Cops are angry because they could be at home sippin brewskies, people aren't at home or work, etc. Let's see if the data backs me up.

You'll notice that the data is disaggregated which is useful for everyone except your boss. Seriously. Once data is aggregated or rolled-up, you lose a lot of the small, important components that made up the big number. For example, I spent all summer taking tables of information that companies disclose and trying to disaggregate their reported numbers into their component parts.

Serious business advice: always store your individual-level, disaggregated data, you can roll it up later if need be.

But on the other hand, a good chunk of report-building is being able to rollup the disaggregated data into something useful, and that's what we're going to do now.

Some of the most common aggregation functions are
  • COUNT
  • SUM
  • AVERAGE
  • MIN
  • MAX
Of course there's also SD and QUARTILES and other useless stuff, but the big 5 are the most common.

Right now we're looking at COUNT. Specifically, COUNTING the number of tickets each day.

> plot(table(parking$date))
> parkagg <- data.frame(table(parking$date))
> names(parkagg)[1] <- "date"
> parkagg$date <- as.Date(parkagg$date)
> parkagg$day <- weekdays(parkagg$date)
> colors <- c("grey", "grey", "grey", "grey", "red", "red", "grey")
> colors <- rep(colors, 9)[1:62]
> barplot(parkagg$Freq, col=colors)


I was about as wrong as you can possibly be. Thanks for coming and I hope you come again.