Cleaning Hockey-Reference Data

Having downloaded data from Hockey-Reference.com in the last post, we’ll now want to prepare it for analysis. This will involve combining all of the files into one dataset, and doing some cleaning. Depending on our planned usage, we may wish to alter team names to provide continuity for moved teams (think Quebec Nordiques to Colorado Avalanche), or to isolate teams that have existed a few times (think about the Winnipeg Jets, or the Ottawa Senators).

We’ll start with combining all the data to one frame. I’m using data from both the NHL and WHA, but fortunately, the data arrives quite uniform. We can actually just merge the data frames we get from read.csv together.

Assuming the data is in the ./_data/ folder (that’s where mine is), a short set of code will do it all. While there’s lots of ways to do this, one that doesn’t use any additional libraries is here:

fileMerge <- function(pathToFiles = ".") {
    f <- list.files(path = pathToFiles, full.names = TRUE, pattern = "*.csv")
    d <- lapply(f, function(x) read.csv(file = x))
    unifiedData <- do.call(rbind, d)
    return(unifiedData)
}

allHockeyData <- fileMerge("./_data/")

In this function, the files are discovered by the list.files call to the provided or default path. The files are read sequentially by lapply call on read.csv, producing a list of data.frames The do.call function is a built-in that iterates through the provided list d, and applies rbind across each item, resulting in one data.frame.

head(allHockeyData)
##   X       Date            Visitor  G               Home G.1 X.1 Att.  LOG
## 1 1 1917-12-19     Toronto Arenas  9 Montreal Wanderers  10     <NA> <NA>
## 2 2 1917-12-19 Montreal Canadiens  7    Ottawa Senators   4     <NA> <NA>
## 3 3 1917-12-22 Montreal Canadiens 11 Montreal Wanderers   2     <NA> <NA>
## 4 4 1917-12-22    Ottawa Senators  4     Toronto Arenas  11     <NA> <NA>
## 5 5 1917-12-26    Ottawa Senators  6 Montreal Wanderers   3     <NA> <NA>
## 6 6 1917-12-26 Montreal Canadiens  5     Toronto Arenas   7     <NA> <NA>
##   Notes
## 1      
## 2      
## 3      
## 4      
## 5      
## 6

Looking at the head of the data, we see there’s some columns that need better names, and a deeper look at the data says that there are a few that likely can be dropped. There’s a mix up in types too, Date should be dates, instead of factors, and some integer lines are as strings.

A few times in the past teams have moved, merged, changed names, or otherwise changed identities in the way the score was kept. Searching through the unique names and deep wikipedia diving have helped me create the following ‘team key’:

Teams Movement (Alphabetical by first appearance)
 
  [1] Alberta Oilers --> Edmonton Oilers 
  [2] Mighty Ducks of Anaheim --> Anaheim Ducks 
* [3] Winnipeg Jets (1972-1996) --> Phoenix Coyotes --> Arizona Coyotes 
  [4] Atlanta Flames --> Calgary Flames 
  [5] Atlanta Thrashers --> Winnipeg Jets 
  [6] Toronto Toros --> Ottawa Nationals --> Birmingham Bulls
  [7] Boston Bruins 
  [8] Quebec Athletic Club/Bulldogs --> Hamilton Tigers --> New York Americans --> Brooklyn Americans
  [9] Buffalo Sabres
 [10] Philadelphia Blazers --> Vancouver Blazers --> Calgary Cowboys
 [11] Oakland Seals --> California Golden Seals --> Cleveland Barons (merged with Minnesota North Stars in 1978)
 [12] New England Whalers --> Hartford Whalers --> Carolina Hurricanes
 [13] Chicago Black Hawks --> Chicago Blackhawks 
 [14] Chicago Cougars
 [15] Cincinnati Stingers
*[16] Cleveland Crusaders --> Minnesota Fighting Saints (1976-1977)
 [17] Quebec Nordiques --> Colorado Avalanche 
 [18] Kansas City Scouts --> Colorado Rockies --> New Jersey Devils 
 [19] Columbus Blue Jackets 
*[20] Minnesota North Stars (merged with Cleveland Barons in 1978) --> Dallas Stars 
 [21] Denver Spurs/Ottawa Civics               
 [22] Detroit Cougars --> Detroit Falcons --> Detroit Red Wings 
 [23] Houston Aeros                            
 [24] Indianapolis Racers
 [25] Los Angeles Kings 
 [26] Los Angeles Sharks --> Michigan Stags/Baltimore Blades
*[27] Minnesota Fighting Saints (1972-1976)                
 [28] Minnesota Wild 
 [29] Montreal Canadiens                       
 [30] Montreal Maroons                         
 [31] Montreal Wanderers                       
 [32] Nashville Predators 
 [33] New York Raiders --> New York Golden Blades/New Jersey Knights --> San Diego Mariners
 [34] New York Islanders                       
 [35] New York Rangers                         
*[36] Ottawa Senators (historical 1883-1934) --> St. Louis Eagles
 [37] Ottawa Senators 
 [38] Philadelphia Flyers 
 [39] Pittsburgh Pirates --> Philadelphia Quakers 
 [40] Phoenix Roadrunners                      
 [41] Pittsburgh Penguins 
 [42] San Jose Sharks 
 [43] St. Louis Blues 
 [44] Tampa Bay Lightning 
 [45] Toronto Arenas --> Toronto St. Patricks --> Toronto Maple Leafs 
 [46] Vancouver Canucks 
 [47] Washington Capitals 

If you look carefully, you’ll find a few international teams making appearances in the league games:

tail(unique(allHockeyData$Visitor), 3)
## [1] Czechoslovakia   Soviet All-Stars Finland         
## 80 Levels: Montreal Canadiens Montreal Wanderers ... Finland

And, looking carfully, you’ll find games cancelled or not yet played:

head(allHockeyData[is.na(allHockeyData$G.1), ])
##         X       Date               Visitor  G                Home G.1  X.1
## 53856 648 2014-01-07   Carolina Hurricanes NA      Buffalo Sabres  NA     
## 53962 754 2014-01-21   Carolina Hurricanes NA Philadelphia Flyers  NA     
## 53978 770 2014-01-24       Ottawa Senators NA Carolina Hurricanes  NA     
## 54182 974 2014-03-10 Columbus Blue Jackets NA        Dallas Stars  NA     
## 57176   1 2016-10-12       St. Louis Blues NA  Chicago Blackhawks  NA <NA>
## 57177   2 2016-10-12        Calgary Flames NA     Edmonton Oilers  NA <NA>
##       Att.  LOG
## 53856 <NA> <NA>
## 53962 <NA> <NA>
## 53978 <NA> <NA>
## 54182 <NA> <NA>
## 57176 <NA> <NA>
## 57177 <NA> <NA>
##                                                              Notes
## 53856                   Postponed due to blizzard until 2014-02-25
## 53962                  Postponed due to snowstorm until 2014-01-22
## 53978 Rescheduled to 2014-01-25 due to rescheduled 2014-01-22 game
## 54182          Game postponed until 2014-04-09 (Medical Emergency)
## 57176                                                         <NA>
## 57177                                                         <NA>

There are some usages where explicit Winner or Loser columns are ideal, or a boolean ‘Tie’ flag. For both of these I’m thinking of the EloRating package, which I’ll talk about later.

Putting all of our requirements into one function leaves us with this. I’ve chosen to place each team substitution in a vector as a pair, then iterate over the data frame to make the substitution. There’s a few manual switches with date filters made to avoid collisions between the old and new versions of teams.

cleanHockeyData <- function(hockeyData, cleanTeams = TRUE, identifyTies = TRUE, 
    listWinnersLosers = TRUE, removeInternational = TRUE) {
    teamReplace <- list(list("Alberta Oilers", "Edmonton Oilers"), list("Mighty Ducks of Anaheim", 
        "Anaheim Ducks"), list("Winnipeg Jets (historical)", "Arizona Coyotes"), 
        list("Phoenix Coyotes", "Arizona Coyotes"), list("Atlanta Flames", "Calgary Flames"), 
        list("Atlanta Thrashers", "Winnipeg Jets"), list("Toronto Toros", "Birmingham Bulls"), 
        list("Ottawa Nationals", "Birmingham Bulls"), list("Quebec Athletic Club/Bulldogs", 
            "Brooklyn Americans"), list("Hamilton Tigers", "Brooklyn Americans"), 
        list("New York Americans", "Brooklyn Americans"), list("Philadelphia Blazers", 
            "Calgary Cowboys"), list("Vancouver Blazers", "Calgary Cowboys"), 
        list("Oakland Seals", "Cleveland Barons"), list("California Golden Seals", 
            "Cleveland Barons"), list("New England Whalers", "Carolina Hurricanes"), 
        list("Hartford Whalers", "Carolina Hurricanes"), list("Chicago Black Hawks", 
            "Chicago Blackhawks"), list("Quebec Nordiques", "Colorado Avalanche"), 
        list("Kansas City Scouts", "New Jersey Devils"), list("Colorado Rockies", 
            "New Jersey Devils"), list("Minnesota North Stars", "Dallas Stars"), 
        list("Detroit Cougars", "Detroit Red Wings"), list("Detroit Falcons", 
            "Detroit Red Wings"), list("Los Angeles Sharks", "Michigan Stags/Baltimore Blades"), 
        list("New York Raiders", "San Diego Mariners"), list("New York Golden Blades/New Jersey Knights", 
            "San Diego Mariners"), list("Pittsburgh Pirates", "Philadelphia Quakers"), 
        list("Toronto Arenas", "Toronto Maple Leafs"), list("Toronto St. Patricks", 
            "Toronto Maple Leafs"), list("Ottawa Senators (historical)", "St. Louis Eagles"))
    
    # ReType frame
    message("retype frame")
    hockeyData <- subset(hockeyData, select = -LOG)
    hockeyData <- subset(hockeyData, select = -X)
    hockeyData$Date <- as.Date(hockeyData$Date)
    hockeyData$Att. <- as.integer(hockeyData$Att.)
    names(hockeyData)[names(hockeyData) == "G"] <- "VisitorGoals"
    names(hockeyData)[names(hockeyData) == "G.1"] <- "HomeGoals"
    names(hockeyData)[names(hockeyData) == "X.1"] <- "OTStatus"
    
    if (identifyTies) {
        hockeyData$Tie <- FALSE
        hockeyData[allHockeyData$X.1 %in% c("2OT", "3OT", "4OT", "5OT", "6OT", 
            "OT", "SO"), ]$Tie <- TRUE
    }
    
    # Remove games against international teams
    if (removeInternational) {
        message("dropping international games")
        hockeyData <- hockeyData[hockeyData$Visitor != "Soviet All-Stars", ]
        hockeyData <- hockeyData[hockeyData$Visitor != "Czechoslovakia", ]
        hockeyData <- hockeyData[hockeyData$Visitor != "Finland", ]
    }
    
    message("dropping unplayed games - future or past cancelled")
    hockeyData <- hockeyData[!is.na(hockeyData$VisitorGoals), ]
    
    if (cleanTeams) {
        # Special Casing out the various teams with repeat existances
        message("special cases")
        levels(hockeyData$Home) <- c(levels(hockeyData$Home), "Winnipeg Jets (historical)")
        levels(hockeyData$Visitor) <- c(levels(hockeyData$Visitor), "Winnipeg Jets (historical)")
        levels(hockeyData$Home) <- c(levels(hockeyData$Home), "Ottawa Senators (historical)")
        levels(hockeyData$Visitor) <- c(levels(hockeyData$Visitor), "Ottawa Senators (historical)")
        
        hockeyData[hockeyData$Visitor == "Winnipeg Jets" & hockeyData$Date < 
            as.Date("1997-01-01", format = "%Y-%m-%d"), ]$Visitor <- "Winnipeg Jets (historical)"
        hockeyData[hockeyData$Home == "Winnipeg Jets" & hockeyData$Date < as.Date("1997-01-01", 
            format = "%Y-%m-%d"), ]$Home <- "Winnipeg Jets (historical)"
        hockeyData[hockeyData$Visitor == "Ottawa Senators" & hockeyData$Date < 
            as.Date("1935-01-01", format = "%Y-%m-%d"), ]$Visitor <- "Ottawa Senators (historical)"
        hockeyData[hockeyData$Home == "Ottawa Senators" & hockeyData$Date < 
            as.Date("1935-01-01", format = "%Y-%m-%d"), ]$Home <- "Ottawa Senators (historical)"
        
        message("reguar replacements")
        for (t in teamReplace) {
            hockeyData[hockeyData$Visitor == t[1], ]$Visitor <- t[2]
            hockeyData[hockeyData$Home == t[1], ]$Home <- t[2]
        }
        
        hockeyData[hockeyData$Date > as.Date("1976-09-01", format = "%Y-%m-%d") & 
            hockeyData$Visitor == "Minnesota Fighting Saints", ]$Visitor <- "Cleveland Crusaders"
        hockeyData[hockeyData$Date > as.Date("1976-09-01", format = "%Y-%m-%d") & 
            hockeyData$Home == "Minnesota Fighting Saints", ]$Home <- "Cleveland Crusaders"
        
        hockeyData$Visitor <- droplevels(hockeyData$Visitor)
        hockeyData$Home <- droplevels(hockeyData$Home)
    }
    
    if (listWinnersLosers) {
        message("find winners and losers")
        hockeyData$Winner <- as.factor(apply(hockeyData, 1, function(x) ifelse(x[3] > 
            x[5], x[2], x[4])))
        # <= works because for ELO a winner & loser don't matter in tie.
        hockeyData$Loser <- as.factor(apply(hockeyData, 1, function(x) ifelse(x[3] <= 
            x[5], x[2], x[4])))
    }
    
    return(hockeyData)
}

Using this function, we see the new output:

cleanData <- cleanHockeyData(allHockeyData)
## retype frame
## Warning in cleanHockeyData(allHockeyData): NAs introduced by coercion
## dropping international games
## dropping unplayed games - future or past cancelled
## special cases
## reguar replacements
## find winners and losers
head(cleanData)
##         Date             Visitor VisitorGoals                Home
## 1 1917-12-19 Toronto Maple Leafs            9  Montreal Wanderers
## 2 1917-12-19  Montreal Canadiens            7    St. Louis Eagles
## 3 1917-12-22  Montreal Canadiens           11  Montreal Wanderers
## 4 1917-12-22    St. Louis Eagles            4 Toronto Maple Leafs
## 5 1917-12-26    St. Louis Eagles            6  Montreal Wanderers
## 6 1917-12-26  Montreal Canadiens            5 Toronto Maple Leafs
##   HomeGoals OTStatus Att. Notes   Tie              Winner
## 1        10            NA       FALSE  Montreal Wanderers
## 2         4            NA       FALSE  Montreal Canadiens
## 3         2            NA       FALSE  Montreal Canadiens
## 4        11            NA       FALSE Toronto Maple Leafs
## 5         3            NA       FALSE    St. Louis Eagles
## 6         7            NA       FALSE Toronto Maple Leafs
##                 Loser
## 1 Toronto Maple Leafs
## 2    St. Louis Eagles
## 3  Montreal Wanderers
## 4    St. Louis Eagles
## 5  Montreal Wanderers
## 6  Montreal Canadiens

We’ll stash this cleaned data frame back in the ./_data/ folder, to make it easier to use in the future.

saveRDS(cleanData, "./_data/hockeyData.Rds")

We’ll dig into this data at a later point.

Written on July 28, 2016