Reproducing a spurious correlation

[share]

This is quick and dirty code that tries to reproduce the analysis and graph from the post: “Per capita consumption of mozzarella cheese (US) correlates with Civil engineering doctorates awarded (US)”

Get the raw data
 1 # get the data from the article
 2 library(XML)
 3 url <- "http://www.tylervigen.com/view_correlation?id=3890"
 4 tables <- readHTMLTable(url)
 5 # we need table 3, minus a couple of extra columns
 6 tab3 <- tables[[3]][1:2,-c(1,12)]
 7 years <- as.numeric(colnames(tab3))
 8 # per capita consumption of mozzarella
 9 cheese <- as.numeric(t(tab3)[,1])
10 # civil engineering doctorates
11 engdoc <-as.numeric(t(tab3)[,2])

Now we will reproduce the plot

Making the plot
1 # code stolen+adapted from http://robjhyndman.com/hyndsight/r-graph-with-two-y-axes/
2 par(mar=c(5,4,4,5)+.1)
3 plot(years, cheese, type="b", lty="dashed", col="blue")
4 par(new=TRUE)
5 plot(years, engdoc, type="b", col="red",xaxt="n",yaxt="n",xlab="",ylab="")
6 axis(4)
7 mtext("engdoc",side=4,line=3)
8 legend("topleft",col=c("blue","red"),lty=1,legend=c("cheese","engdoc"))

Per capita consumption of mozzarella cheese (US) correlates with Civil engineering doctorates awarded (US)

And calculate the correlation (which happens to be high)

# calculate the correlation
cor(cheese, engdoc)
## [1] 0.9586

This bit was made to answer a question from a friend (Hi César!), and originally was published as a gist: https://gist.github.com/jmcastagnetto/59c26c2624e621a13582