Much of this blog will be focused on NHL data, as I mentioned in the opening post. There are two main sources for data about NHL hockey, from Hockey-Reference.com and from the R package nhlscrapr. This post we’ll focus on getting score data from Hockey-Reference from the start of the NHL to the present day.
A typical season’s results on Hockey-Reference has this url format (http://www.hockey-reference.com/leagues/NHL_2016_games.html), and looks like this:
Copy-pasting this to import would be tedious, particuarly with 100 years of NHL hockey approaching. That’s too much data to handle manually, so we’ll write some scraping code to get it for us.
For this, we’ll use some of R’s built in tools, as well as the XML library. This library will allow us to automatically parse the HTML files, which we’ll save to a .csv file for later.
We can read the page (with the XML library loaded) with the readHTMLTable function. This will find all the table chunks, of which there are typically 2 on these pages, and store them. An alternate way to do this would be to read the whole HTML file with htmlParse, then extract the tables, but this one step is easier.
We have two tables here, namely tables['games'] and tables['games_playoff']. This contains the data we wanted:
We can save this to a .csv file very simply with write.csv(tables['games'], file = '2016games.csv'). But that only gets us one season.
We’ll set it up in a for-loop, to be able to get a few more seasons at once, say from 2010-2015. I’ve broken the regular season and playoffs apart before saving, for easy referece. I’ve also added Sys.sleep(20), as it’s polite not to slam servers with requests but to let other traffic through.
A few things need to be taken care of here. What if we aren’t currently in the playoffs of the most recent year available (currently, the 2017 season). That portion will create an empty file, so let’s escape that.
if (!is.null(playoff)) {write.csv(playoff, file = paste0("./", i - 1, i, "Playoffs.csv"))}
As well, what if we try collect the season data for 2005? Give it a shot here. There was no hockey that year, due to the lockout, so we need to skip 2005 in the for loop.
if (i == 2005) {next}
Now, what if the website is down, or something else happens weird while parsing data? We should wrap our readHTMLTable in a tryCatch, and our parsing and saving in a null checker.
I’ll also add some messages to the code, as we can wait for quite some time between starting the loop and getting all of the data out. By using the message("Message", "\r", appendLF = FALSE) format, each message will be on the same line as the next message. the final message for ‘waiting’ will return a new line afterwards.
Putting this all together:
We’ve almost got our scraper ready, but it would be nice to host it all in a simple function call. Robust functions have input checks, and this will need a few.
a) We want start to be less than end,
b) We dont want to start before the 1917-1918 season,
c) We don’t want to start in the future (past 2016-2017),
d) We can’t end past the future (again, 2016-2017),
e) We can’t start and/or end in 2005 (no season).
Wrapping it all up:
Woah, I hear you saying that you might want to see the WHA scores too, since the leagues merged in the late ’70s? Sure!
So now after you call getAndSaveNHLGames() and getAndSaveWHAGames() and wait a while, you’ve got a full set of score data for NHL and WHA games. Cool! I’ll look later at cleaning some of this data for easy use.