As interesting as it is to predict how well teams will do on a team-by-team basis, based on their past performance, it would be great to get better granularity and be able to dig into what happened each game. This data is available, online, from the NHL website. Manually downloading it all would be horrendous (there are 1230 games each year, plus playoffs). Fortunately, a package exists in CRAN to help with this.
(un?)Fortunately, the authors of nhlscrapr got jobs with NHL teams, and gave up working on the project. Others have written update patches, such as Jack Davis at the Musings on Statistics et.al. blog. You can download the patch from his site, and apply it as directed. I’ve made a few extra adjustments to get 2015-2016 to properly process, and this can be found on my GitHub.
With the library and patch loaded, we can start getting all of the data from the NHL site.
We start by defining the database of all the games available:
We use the extra.seasons parameter because the original code was written to cover up to 2014-2015. By calling in some extra.seasons, we can get the game data for later seasons.
Because the data is quite large, and downloading takes a long time (even with a short delay between games), it’s common to work on one season at a time.
Once this has completed running (about 2 hours), there’s two files. Our nhlscrapr-20152016.RData file, as well as nhlscrapr-core.RData. The first stores the play by play data, and the second is a legend for player identifications.
We’ll load the data into the environment and play with them.
Let’s look at the data to see what we have:
Each player in the roster is listed, with an indexing number (index of 1 is a ‘no player’ cheat index). Each player’s numbers of appearance in each position is given too (pC, pL, pR, pD, pG).
The index makes an appearance in the play_by_play data as well.
There’s lots of info here. When it happened, what kind of event, who was on (player index), which team ‘won’ the event, which players it happened to, distance to the net, location on the ice, the score at that time, who was in net, how many players on the ice…
Soon we’ll go throught some real data analysis as we try to get an idea of what we can all pull from this.