Better Season Results Visualizations

Intro

No one likes to hear about their team having a back to back in any sport. With hockey, it’s the same.

Data

We’ll use some of the old scripts for loading and processing the hockey data. We’ll look at the last ten complete years of data (2006-2007 to 2016-2017).

#Load Data
nhl10<-readHockeyData(data_dir = "./_data/", nhl_year_list = c(2007:2017), playoffs = FALSE)
nhl10<-nhl10[nhl10$League == "NHL",]
nhl10<-nhl10[,c(1,2,3,4,5,6,11)]

A nasty set of ifs takes every game, and looks in the past few days to see if the home or away team was playing. If so, it sets the rest to 0 days rest (back to backs) up to 3 days (including > 3 days).

#Finding back to backs, rested, etc.
nhl10$HomeRest <- rep(3)
nhl10$VisitorRest <- rep(3)
 
for (g in seq_len(nrow(nhl10))){
  d<-nhl10[g,1]
  h<-nhl10[g,2]
  v<-nhl10[g,4]
 
  if(h %in% nhl10[nhl10[,1] == d - 1, 2] || h %in% nhl10[nhl10[,1] == d - 1, 4]){
    nhl10[g,8]<-0
  } else if(h %in% nhl10[nhl10[,1] == d - 2, 2] || h %in% nhl10[nhl10[,1] == d - 2, 4]){
    nhl10[g,8]<-1
  } else if(h %in% nhl10[nhl10[,1] == d - 3, 2] || h %in% nhl10[nhl10[,1] == d - 3, 4]){
    nhl10[g,8]<-2
  }
  if(v %in% nhl10[nhl10[,1] == d - 1, 2] || v %in% nhl10[nhl10[,1] == d - 1, 4]){
    nhl10[g,9]<-0
  } else if(v %in% nhl10[nhl10[,1] == d - 2, 2] || v %in% nhl10[nhl10[,1] == d - 2, 4]){
    nhl10[g,9]<-1
  } else if(v %in% nhl10[nhl10[,1] == d - 3, 2] || v %in% nhl10[nhl10[,1] == d - 3, 4]){
    nhl10[g,9]<-2
  }
}
 
nhl10$RestDifferent<-nhl10$HomeRest - nhl10$VisitorRest
nhl10$BinResult<-round(nhl10$Result)
nhl10$TriResult<-ifelse((nhl10$Result < 0.9 & nhl10$Result > 0.1), 0.5, nhl10$Result)

Analysis

We can see how the league typically schedules teams with a few quick ggplot graphics.

ggplot(data=nhl10) + 
  geom_histogram(aes(x=HomeRest, ..density..), fill='blue', alpha=0.3, binwidth = 1, center=0) +
  geom_histogram(aes(x=VisitorRest, ..density..), fill='red', alpha=0.3, binwidth = 1, center = 0) +
  ggtitle("Proportion of Days of Rest for Home(Blue) and Away (Red) Teams") + 
  xlab("Days of Rest") + 
  ylab("Proportion of Games") + 
  theme_bw()

plot of chunk restDayCompare

Clearly, the league doesn’t like to schedule back to back games for a team to have the second game on the road, as much as having the second game at home. This is accompanied by a similar reduction in rest days at home for 1-3 days.

Lets also look at the winning percentages for teams in each situation. To start, we’ll note that the total win percent for the home team in the past decade has been 54.65%. For each situation, the table is shown below

m<-matrix(rep(0.5), nrow = 4, ncol = 4)
rownames(m)<-c("Home.0", "Home.1", "Home.2", "Home.3+")
colnames(m)<-c("Visitor.0","Visitor.1","Visitor.2","Visitor.3+")
for(a in seq(1:4)){
  for(b in seq(1:4)){
    r<-mean(nhl10[nhl10$HomeRest == (a-1) & nhl10$VisitorRest == (b-1),"BinResult"])
    m[a,b] <- r
  }
}
pander(m)

Visitor.0 Visitor.1 Visitor.2 Visitor.3+ ————- ———– ———– ———– ———— Home.0 0.5453 0.5941 0.5865 0.5116

Home.1 0.5435 0.5416 0.5521 0.5262

Home.2 0.4663 0.5281 0.5451 0.484

Home.3+ 0.4304 0.5662 0.559 0.5099

So, it looks like there’s some situations to avoid, such as being at home very rested against a busy visitor team. A win percentage of 43.04% is very low. That’s an interesting finding, but seems to hold some water as it’s happened 79 times in the past 10 years. Actually, none of these cases are infrequent, below is a table of the counts of each case.

m<-matrix(rep(0.5), nrow = 4, ncol = 4)
rownames(m)<-c("Home.0", "Home.1", "Home.2", "Home.3+")
colnames(m)<-c("Visitor.0","Visitor.1","Visitor.2","Visitor.3+")
for(a in seq(1:4)){
  for(b in seq(1:4)){
    r<-length(nhl10[nhl10$HomeRest == (a-1) & nhl10$VisitorRest == (b-1),"BinResult"])
    m[a,b] <- r
  }
}
pander(m)

Visitor.0 Visitor.1 Visitor.2 Visitor.3+ ————- ———– ———– ———– ———— Home.0 629 1584 578 301

Home.1 506 4119 1065 458

Home.2 178 1032 798 188

Home.3+ 79 385 161 455

The above results look at overtime and shootout results with the same importance as regular wins. If we turn all overtime results into a set of {0, 0.5, 1}, representing home loss, overtime (either team winning), and home win, respectively, then we’ll get the following win chances:

m<-matrix(rep(0.5), nrow = 4, ncol = 4)
rownames(m)<-c("Home.0", "Home.1", "Home.2", "Home.3+")
colnames(m)<-c("Visitor.0","Visitor.1","Visitor.2","Visitor.3+")
for(a in seq(1:4)){
  for(b in seq(1:4)){
    r<-mean(nhl10[nhl10$HomeRest == (a-1) & nhl10$VisitorRest == (b-1),"TriResult"])
    m[a,b] <- r
  }
}
pander(m)

Visitor.0 Visitor.1 Visitor.2 Visitor.3+ ————- ———– ———– ———– ———— Home.0 0.5644 0.5821 0.5779 0.5133

Home.1 0.5296 0.5426 0.5521 0.5295

Home.2 0.4831 0.5349 0.5476 0.4681

Home.3+ 0.4241 0.5403 0.5342 0.5099

The same effect is shown whether we account for overtime games or not.

Finally, we’ll look at expected goal differential, to see if there’s any insights in goal production by rest difference.

nhl10$GoalDiff <- nhl10$HomeGoals - nhl10$VisitorGoals
nhl10[nhl10$OTStatus != '', ]$GoalDiff <- 0
ggplot(data=nhl10, aes(x=RestDifferent, y=GoalDiff)) + 
  geom_point() +
  geom_smooth(method='lm') +
  ggtitle("Goal differential by Days Rest (+ favours home team)") + 
  xlab("Days of Rest") + 
  ylab("Goal Differential") + 
  theme_bw()

plot of chunk goals

While technically that’s a negative line of best fit, the R^2 value for the fit is only 0.001458, which is functionally useless. Thus, no determination of the goal difference can be drawn from the amount of rest of the teams.

Conclusions

Everyone hates the thought of their favorite team playing back to back games. But, there’s no reason to fear. In fact, the data suggest that longer rests are more detrimental to a team’s performance.

Riddler 2017-08-04: Hot Potato

Hot Potato

From FiveThirtyEight’s riddler this week:

TensorFlow and R for NLP

Those of you who are interested in machine learning will likely have heard of Google’s TensorFlow. While R is not officially supported, RStudio has developed a wrapper to be able to use TensorFlow in R. More information, and a few tutorials, are available on the website, but I’ll add to that list with some Natural Language Processing (NLP) examples, since they seem to not be overly abundant online.

TSP in R Part 2

Last time we created a distance matrix and a time matrix for use in TSP problems. We’re using a set of locations in the Ottawa, Ontario, Canada area, but any list of locations with addresses would work. Now we’ll work through getting optimized routes to visit each address once. We’ll optimize by distance, but we also generated a ‘time matrix’ and could run the TSP solver that way.

TSP in R Part 1

I’ve been playing around recently with some Travelling Salesperson Problems (TSP), and by extension some Vehicle Routing Problems (VRP). For example, when people come to visit Ottawa, are they being the most optimal with visiting a list of sites, that is, spending the least time or distance in their cars travelling between places? If their trip takes more than one day, does that change the order they see things?

Building Concorde for osX

I’ve been playing around with Travelling Salesperson Problems (TSP) recently. A package for R, TSP, contains most basic solvers, but it doesn’t contain one of the best, Concorde.

Concorde is available as prebuilt binaries for many platforms, but not for osX. For Macs, it has to be downloaded and built from the source. It can be tough to build it on osX, but after much digging I’ve found some instructions. It required the Way Back Machine to dig it out of some archives, so I’m reposting it here for posterity.

NHL Stadium Locations

As part of the Data Science Specialization offered by John Hopkins through Coursera, I have a project to create an ‘interactive map’ in a rmarkdown post. I figured I should map where the NHL stadiums are.

Scoring ELO Over a Season

On Twitter, there are many excellent hockey analytics folks. One in particular, @IneffectiveMath, is worth following. Amongst other things, such as great visualizations of data, he’s running a contest this year that is rating models on their season-long predictions, using the following scoring scheme:

New R Package: HockeyScrapR

Based on work that I did earlier, with collecting scores information and player data from Hockey-Reference.com, I’ve developed a small R package to help with this scraping…

HockeyScrapR

Optimizing Elo Parameters for Game Predictions

In the past few weeks, I’ve been optimizing parameters for Elo based predicting of NHL data. The code is complex and won’t be put here. Check the sourcecode in the repo.

I’ve put those results together in a combo plot, built using ggplot and gridExtra to make things better arranged. Check the sourcecode for this post for the details on that.

Scraping Player Data

Hockey-Reference.com is a wonderful tool, with hoards of data to be played with. We’ve used their great site for scraping score data (see this post), but there is a full stats breakdown of every player who has ever played in or been drafted to the NHL on their site as well.

We’ll see this post how to write a scraper to collect that data for future use.

The Puzzle Of The Lonesome King

From http://fivethirtyeight.com/features/the-puzzle-of-the-lonesome-king/.

A coronation probability puzzle from Charles Steinhardt:

The childless King of Solitaria lives alone in his castle. Overly lonely, the king one day offers one lucky subject the chance to be prince or princess for a day. The loyal subjects leap at the opportunity, having heard tales of the opulent castle and decadent meals that will be lavished upon them. The subjects assemble on the village green, hoping to be chosen.

The winner is chosen through the following game. In the first round, every subject simultaneously chooses a random other subject on the green. (It’s possible, of course, that some subjects will be chosen by more than one other subject.) Everybody chosen is eliminated. (Not killed or anything, just sent back to their hovels.) In each successive round, the subjects who are still in contention simultaneously choose a random remaining subject, and again everybody chosen is eliminated. If there is eventually exactly one subject remaining at the end of a round, he or she wins and heads straight to the castle for f?ting. However, it’s also possible that everybody could be eliminated in the last round, in which case nobody wins and the king remains alone. If the kingdom has a population of 56,000 (not including the king), is it more likely that a prince or princess will be crowned or that nobody will win?

Extra credit: How does the answer change for a kingdom of arbitrary size?

NHL and Elo Through the Years - Part 3

Having Elo ratings for teams over all time is cool, but how do we know that it’s meaningful? Sure, we can look at the Stanley Cup winning team each year, and see that they typically have a good rating. Or, we can anicdotally look back at our favourite team, remember how good or bad they were for a few seasons in the past, and see that they were near the top or the bottom of the pile at that point in time.

NHL and Elo Through the Years - Part 2

As promised, I’ve created a shiny app containing all of the Elo Rankings through NHL and WHA history. You can see this app here, or embedded below.

NHL and Elo Through the Years - Part 1

I’ve developed my own Elo toolset, with options available that I discussed in this earlier post. This includes an adjustment option for home ice advantage, and isn’t pinned down to any specific set of possible results (e.g. able to give overtime wins less of a boost than reguar time wins). Lets take a look at the Elo ratings over all time in the NHL.

New Elo for NHL

Last time, we looked at Elo ratings for NHL teams. We saw that the more fancy Elo ratings didn’t keep the average constant, that is, ratings were inflated through time. As well, they didn’t take into consideration any summertime normalization, whereby the elo ratings were adjusted towards the mean as is very common in sports Elo. Any other common adjustments, such as placing increased rating on playoff games, aren’t contained therein either. I’ll look into developing a set of those tools, specific to our usages..

Home Ice Advantage

While working on a new Elo rankings posts, I played around with data to determine a good value for home ice advantage (in terms of winning percentage). That value has changed drastically over the years, and I thought it was interesting, so I’d put it up here.

Chemistry Intro

Along with playing around with hockey data, my real job is doing chemistry. Sometimes, this blog will contain work from that field, discussing tools I use or have written, papers that I find interesting, or other cool stuff. I say this, because I’ve started working on a few posts that will trickle out slowly interspersed in the hockey. To easily find these types of post, use the Tags feature, and look for ‘chemistry’.

NHL Elo

Predicting scores (and seasons) by Dixon-Coles is interesting, but it’s one of many ways of doing ‘game-level’ predictions. There’s a family of rating systems called Elo, which was originally developed to rank chess players. There are a number of extensions of Elo, including some modifications to parameters by the World Chess Federation (FIDE), a modification including uncertainty and ‘reliability’ called Glicko, and a more parameterized version of Glicko developed in 2012 called Stephenson. These are all implemented in the PlayerRatings package in R. There’s also an modification of Glicko developed by Microsoft called TrueSkill and this is implemented in the aptly named trueskill package. Note that TrueSkill is a closed licence product, available only for non-commercial implementations.

We’ll compare all of these methods for their historical performance in NHL, as well as (eventually) go into predicting the coming season. TrueSkill has a few oddities, so we’ll look at it later.

nhlscrapr and Play-By-Play Data

As interesting as it is to predict how well teams will do on a team-by-team basis, based on their past performance, it would be great to get better granularity and be able to dig into what happened each game. This data is available, online, from the NHL website. Manually downloading it all would be horrendous (there are 1230 games each year, plus playoffs). Fortunately, a package exists in CRAN to help with this.

Predicting 2016-2017 NHL Season Results

Now to some new posts!

When making a prediction engine, it’s always fun to see what next season looks like. We have the schedule for the 2016-2017 NHL season, and we have all the data from the past seasons, so let’s get some calculations going!

Better Season Results Visualizations

In the past, we’ve Having a table of expected positions looks terrible. It’s hard to read, doesn’t fit in a page, and that much data is hard to absorb on the face of it.

Luckily, R has some great visualization tools in ggplot2. I’ll demonstrate some new ways to visuallize results, based on past posts’ predictions, and I’ll use these in future posts.

Evaluating the Models