Having downloaded data from Hockey-Reference.com in the last post, we’ll now want to prepare it for analysis. This will involve combining all of the files into one dataset, and doing some cleaning. Depending on our planned usage, we may wish to alter team names to provide continuity for moved teams (think Quebec Nordiques to Colorado Avalanche), or to isolate teams that have existed a few times (think about the Winnipeg Jets, or the Ottawa Senators).
We’ll start with combining all the data to one frame. I’m using data from both the NHL and WHA, but fortunately, the data arrives quite uniform. We can actually just merge the data frames we get from read.csv together.
Assuming the data is in the ./_data/ folder (that’s where mine is), a short set of code will do it all. While there’s lots of ways to do this, one that doesn’t use any additional libraries is here:
In this function, the files are discovered by the list.files call to the provided or default path. The files are read sequentially by lapply call on read.csv, producing a list of data.frames The do.call function is a built-in that iterates through the provided list d, and applies rbind across each item, resulting in one data.frame.
Looking at the head of the data, we see there’s some columns that need better names, and a deeper look at the data says that there are a few that likely can be dropped. There’s a mix up in types too, Date should be dates, instead of factors, and some integer lines are as strings.
A few times in the past teams have moved, merged, changed names, or otherwise changed identities in the way the score was kept. Searching through the unique names and deep wikipedia diving have helped me create the following ‘team key’:
Teams Movement (Alphabetical by first appearance)
[1] Alberta Oilers --> Edmonton Oilers
[2] Mighty Ducks of Anaheim --> Anaheim Ducks
* [3] Winnipeg Jets (1972-1996) --> Phoenix Coyotes --> Arizona Coyotes
[4] Atlanta Flames --> Calgary Flames
[5] Atlanta Thrashers --> Winnipeg Jets
[6] Toronto Toros --> Ottawa Nationals --> Birmingham Bulls
[7] Boston Bruins
[8] Quebec Athletic Club/Bulldogs --> Hamilton Tigers --> New York Americans --> Brooklyn Americans
[9] Buffalo Sabres
[10] Philadelphia Blazers --> Vancouver Blazers --> Calgary Cowboys
[11] Oakland Seals --> California Golden Seals --> Cleveland Barons (merged with Minnesota North Stars in 1978)
[12] New England Whalers --> Hartford Whalers --> Carolina Hurricanes
[13] Chicago Black Hawks --> Chicago Blackhawks
[14] Chicago Cougars
[15] Cincinnati Stingers
*[16] Cleveland Crusaders --> Minnesota Fighting Saints (1976-1977)
[17] Quebec Nordiques --> Colorado Avalanche
[18] Kansas City Scouts --> Colorado Rockies --> New Jersey Devils
[19] Columbus Blue Jackets
*[20] Minnesota North Stars (merged with Cleveland Barons in 1978) --> Dallas Stars
[21] Denver Spurs/Ottawa Civics
[22] Detroit Cougars --> Detroit Falcons --> Detroit Red Wings
[23] Houston Aeros
[24] Indianapolis Racers
[25] Los Angeles Kings
[26] Los Angeles Sharks --> Michigan Stags/Baltimore Blades
*[27] Minnesota Fighting Saints (1972-1976)
[28] Minnesota Wild
[29] Montreal Canadiens
[30] Montreal Maroons
[31] Montreal Wanderers
[32] Nashville Predators
[33] New York Raiders --> New York Golden Blades/New Jersey Knights --> San Diego Mariners
[34] New York Islanders
[35] New York Rangers
*[36] Ottawa Senators (historical 1883-1934) --> St. Louis Eagles
[37] Ottawa Senators
[38] Philadelphia Flyers
[39] Pittsburgh Pirates --> Philadelphia Quakers
[40] Phoenix Roadrunners
[41] Pittsburgh Penguins
[42] San Jose Sharks
[43] St. Louis Blues
[44] Tampa Bay Lightning
[45] Toronto Arenas --> Toronto St. Patricks --> Toronto Maple Leafs
[46] Vancouver Canucks
[47] Washington Capitals
If you look carefully, you’ll find a few international teams making appearances in the league games:
And, looking carfully, you’ll find games cancelled or not yet played:
There are some usages where explicit Winner or Loser columns are ideal, or a boolean ‘Tie’ flag. For both of these I’m thinking of the EloRating package, which I’ll talk about later.
Putting all of our requirements into one function leaves us with this. I’ve chosen to place each team substitution in a vector as a pair, then iterate over the data frame to make the substitution. There’s a few manual switches with date filters made to avoid collisions between the old and new versions of teams.
Using this function, we see the new output:
We’ll stash this cleaned data frame back in the ./_data/ folder, to make it easier to use in the future.