Dixon-Coles Prediction of a Single Hockey Game
## Error in optim(par = par.inits, fn = DCoptimFn, DCm = dcm, xi = xi, method = "BFGS", : non-finite finite-difference value [2]
Note: This is earlier work I did (last winter/spring) so some info may seem dated at time of posting. I’ve used data files current to then.
In section 1, we prepared historical hockey data for analysis. In section 2, we set up some functions do prepare the Dixon-Coles parameters. Now, we can use them to predict a game.
Poisson analysis is good at predicting, based on an average, the proportion of each integer events happening. So, if we expect 3.1 goals to be scored by a certain team as an average, we know that the proportion of that is going to be (in R code) dpois(3,3.1)
, equalling 0.2236768. Similarly, the chance that 1 and 2 goals is scored by that team to be dpois(c(1,2), 3.1)
to be 0.1396525 and 0.2164614 respectively.
The Dixon-Coles optimization gave us ‘attack’ and ‘defence’ parameters, along with a home field advantage factor home
and a fudge fator for low-scoring games, ‘rho’ (ρ). For two teams, Home and Away, we can calculate their expected goals socred by adding their attack, plus the opposing teams’ defence, plus (for home) the home field advantage. This will give us a ‘lambda’ (λ) and ‘mu’ (μ) which are the home and away goals expected. For example, let’s use Toronto Maple Leafs vs. Montreal Canadiens:
home <- "Montreal Canadiens"
away <- "Toronto Maple Leafs"
Montreal is a stronger team this year, we expect a higher attack and lower defence factor. For the 2015 season only, they are, in fact, 0.9781757 and -0.267567. Similarly, Toronto’s factors are 0.9660961 and 0.0829082 respectively.
We’ll do the addition to get λ and μ:
# Expected goals home
lambda <- exp(res_2015$par["HOME"] + res_2015$par["Attack.Montreal Canadiens"] +
res_2015$par["Defence.Toronto Maple Leafs"])
# Expected goals away
mu <- exp(res_2015$par["Attack.Toronto Maple Leafs"] + res_2015$par["Defence.Montreal Canadiens"])
Thus, we expect Montreal to score 3.1527745 goals and Toronto to score 2.0107929 goals.
Using this knowledge, we can create a probability matrix of Home and Away goals’ Poisson probability. Using λ and μ, probability_matrix <- dpois(0:maxgoal, lambda) %*% t(dpois(0:maxgoal, mu))
, and a maximum number of goals per team of 8:
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.0057 | 0.0115 | 0.0116 | 0.0078 | 0.0039 | 0.0016 | 5e-04 | 2e-04 | 0 |
1 | 0.018 | 0.0363 | 0.0365 | 0.0244 | 0.0123 | 0.0049 | 0.0017 | 5e-04 | 1e-04 |
2 | 0.0284 | 0.0572 | 0.0575 | 0.0385 | 0.0194 | 0.0078 | 0.0026 | 7e-04 | 2e-04 |
3 | 0.0299 | 0.0601 | 0.0604 | 0.0405 | 0.0204 | 0.0082 | 0.0027 | 8e-04 | 2e-04 |
4 | 0.0236 | 0.0474 | 0.0476 | 0.0319 | 0.016 | 0.0065 | 0.0022 | 6e-04 | 2e-04 |
5 | 0.0149 | 0.0299 | 0.03 | 0.0201 | 0.0101 | 0.0041 | 0.0014 | 4e-04 | 1e-04 |
6 | 0.0078 | 0.0157 | 0.0158 | 0.0106 | 0.0053 | 0.0021 | 7e-04 | 2e-04 | 1e-04 |
7 | 0.0035 | 0.0071 | 0.0071 | 0.0048 | 0.0024 | 0.001 | 3e-04 | 1e-04 | 0 |
8 | 0.0014 | 0.0028 | 0.0028 | 0.0019 | 9e-04 | 4e-04 | 1e-04 | 0 | 0 |
Table: Probability of specific score matrix
This has the away score on the top (as columns), and the home score along the side (as rows). Recall that we need to apply the τ function to this matrix, to account for low scoring games. Once that is done, we can sum the diagonal of the matrix to find the probability of a tie. The sum of the upper triangle is the probability of an away win, and the sum of the lower triangle is the home win.
Putting it all together we get this function:
predictResult <- function(res, home, away, maxgoal = 8) {
attack.home <- paste("Attack", home, sep = ".")
attack.away <- paste("Attack", away, sep = ".")
defence.home <- paste("Defence", home, sep = ".")
defence.away <- paste("Defence", away, sep = ".")
# Expected goals home
lambda <- exp(res$par["HOME"] + res$par[attack.home] + res$par[defence.away])
# Expected goals away
mu <- exp(res$par[attack.away] + res$par[defence.home])
probability_matrix <- dpois(0:maxgoal, lambda) %*% t(dpois(0:maxgoal, mu))
scaling_matrix <- matrix(tau(c(0, 1, 0, 1), c(0, 0, 1, 1), lambda, mu, res$par["RHO"]),
nrow = 2)
probability_matrix[1:2, 1:2] <- probability_matrix[1:2, 1:2] * scaling_matrix
HomeWinProbability <- sum(probability_matrix[lower.tri(probability_matrix)])
DrawProbability <- sum(diag(probability_matrix))
AwayWinProbability <- sum(probability_matrix[upper.tri(probability_matrix)])
return(c(HomeWinProbability, DrawProbability, AwayWinProbability))
}
probs_2015 <- predictResult(res_2015, home, away)
probs_all <- predictResult(res_all, home, away)
## Error in predictResult(res_all, home, away): object 'res_all' not found
probs_2015_time <- predictResult(time_res_2015, home, away)
probs_all_time <- predictResult(time_res_all, home, away)
Now we can look at one of the results (say, 2015 not time weighted) and see the probability of a home win (Montreal), draw, or away win (Toronto): 0.5968457, 0.1736576, 0.2240437. In fact, we can plot the probabilities for the four fits we have, to show the effect of each fit type.
## Error in rbind(probs_2015, probs_2015_time, probs_all, probs_all_time): object 'probs_all' not found
## Error in colnames(probs_df) <- c("Montreal Win", "Draw", "Toronto Win"): object 'probs_df' not found
## Error in probs_df$Sim.Type <- c("2015 Season", "2015 Season Time Weighted", : object 'probs_df' not found
## Error in melt(probs_df, id.vars = "Sim.Type"): object 'probs_df' not found
## Error in ggplot(pmelt, aes(y = value, x = variable, fill = Sim.Type)): object 'pmelt' not found
While the odds of a draw haven’t changed much, the odds of a Montreal or Toronto win have slightly. Note that over the past 10 years, Montreal has performed better, on average, than Toronto, so this is expected. As well, Monteal benefits from the home ice advantage. We can calculate this for a Toronto home game too.
## Error in predictResult(res_all, away, home): object 'res_all' not found
## Error in colnames(probs_df) <- c("Toronto Win", "Draw", "Montreal Win"): object 'probs_df' not found
## Error in probs_df$Sim.Type <- c("2015 Season", "2015 Season Time Weighted", : object 'probs_df' not found
## Error in melt(probs_df, id.vars = "Sim.Type"): object 'probs_df' not found
## Error in ggplot(pmelt, aes(y = value, x = variable, fill = Sim.Type)): object 'pmelt' not found
Why the difference in these two results? That’s because the home advantage factor ranges not insignificantly. While not a huge range, it does impact the probabilities enough to make a noticeable difference.
We can predict scores for our Toronto at Montreal game with some random number work. First, we’ll modify the probability matrix to contain the sum
of the probabilities to that point. Then we can index goals based on the probability correllating to a random number.
pmatrix <- matrix(nrow = nrow(probability_matrix), ncol = ncol(probability_matrix))
current_p <- 0
for (i in 1:ncol(pmatrix)) {
for (j in 1:nrow(pmatrix)) {
pmatrix[j, i] <- current_p
current_p <- current_p + probability_matrix[j, i]
}
}
predictOneGame <- function(pmatrix, stats, home, away) {
random <- runif(1)
# This ensures that there's no issue with random number larger than the
# largest pmatrix probability
while (random > pmatrix[nrow(pmatrix), ncol(pmatrix)]) {
random <- runif(1)
}
score <- as.vector(which(pmatrix > random, arr.ind = T)[1, ])
# scores is matrix c(home away), returning INDEX (ie 0-0 is 1,1)
score <- score - 1
return(score)
}
So, we can predict the score for a game with predictOneGame
, and get a result of home away
:
predictOneGame(pmatrix, home, away)
## [1] 4 1
Lets’ do that a few times, to see the different results we get.
t(replicate(10, predictOneGame(pmatrix, home, away)))
## [,1] [,2]
## [1,] 6 1
## [2,] 5 2
## [3,] 5 4
## [4,] 3 1
## [5,] 4 4
## [6,] 2 5
## [7,] 8 2
## [8,] 6 2
## [9,] 4 0
## [10,] 3 1
But, recall that the NHL doesn’t allow draws. We cold solve that by randomly choosing a winner, but that does a disservice to teams who excel at those scenarios. A simple way of adding OT is to use the log5
method, invented by Bill James, which applies the following formula, (from Wikipedia) The Log5 estimate for the probability of A defeating B is $p_{A,B} = \frac{p_A-p_A\times p_B}{p_A+p_B-2\times p_A\times p_B}$, where $p_A$ is the proportion of A wins, and $p_B$ is the proportion of B wins. For now, we’ll plug in even values (we’ll use pa=pb=0.5
, so log5 = (pa-(pa*pb))/(pa+pb-(2*pa*pb))
will equal 0.5 as well), but later we can re-evaluate performance of each team.
We’ll also go 50/50 for Shootout and Overtime, but can adjust those odds later too.
Re-writing the score prediction formula will give us the chance to simulate draw handling.
predictOneGame <- function(pmatrix, stats, home, away) {
random <- runif(1)
# This ensures that there's no issue with random number larger than the
# largest pmatrix probability
while (random > pmatrix[nrow(pmatrix), ncol(pmatrix)]) {
random <- runif(1)
}
score <- as.vector(which(pmatrix > random, arr.ind = T)[1, ])
# scores is matrix c(home away), returning INDEX (ie 0-0 is 1,1)
score <- score - 1
score[3] <- NA
if (score[1] == score[2]) {
if (runif(1) > 0.5) {
score[1] <- score[1] + 1
} else {
score[2] <- score[2] + 1
}
if (runif(1) > 0.5) {
score[3] <- "OT"
} else {
score[3] <- "SO"
}
}
return(score)
}
Trying this score simulation again:
t(replicate(10, predictOneGame(pmatrix, home, away)))
## [,1] [,2] [,3]
## [1,] "2" "1" NA
## [2,] "2" "3" NA
## [3,] "7" "1" NA
## [4,] "4" "3" NA
## [5,] "3" "2" NA
## [6,] "4" "3" "SO"
## [7,] "4" "3" NA
## [8,] "7" "4" NA
## [9,] "3" "1" NA
## [10,] "7" "2" NA
To predict the winner of a game in OT, we’ll use the win percentages of each team in the most recent season. Let’s grab the results from 2015 and figure out each team’s OT and Shootout results. We have to look at each game, so we’ll take this opportunity to collect the rest of the stats available to make a stats table.
makeStatsTable <- function(df) {
tmpTable = data.frame(Team = sort(unique(df$AwayTeam)), GP = 0, W = 0, OTL = 0,
L = 0, ROW = 0, HomeGames = 0, HomeWin = 0, HomeOTW = 0, HomeSOW = 0,
HomeOTL = 0, HomeLoss = 0, AwayGames = 0, AwayWin = 0, AwayOTW = 0,
AwaySOW = 0, AwayOTL = 0, AwayLoss = 0, P = 0, HomeFor = 0, HomeAgainst = 0,
AwayFor = 0, AwayAgainst = 0, GF = 0, GA = 0, DIFF = 0, PP = 0, OT.Win.Percent = 0)
# Games Played
tmpTable$HomeGames = as.numeric(table(df$HomeTeam))
tmpTable$AwayGames = as.numeric(table(df$AwayTeam))
# Wins
tmpTable$HomeWin = as.numeric(table(df$HomeTeam[df$HG > df$AG]))
tmpTable$AwayWin = as.numeric(table(df$AwayTeam[df$AG > df$HG]))
# Losses
tmpTable$HomeLoss = as.numeric(table(df$HomeTeam[df$AG > df$HG]))
tmpTable$AwayLoss = as.numeric(table(df$AwayTeam[df$HG > df$AG]))
# OT Wins
tmpTable$HomeOTW = as.numeric(table(df$HomeTeam[(df$OT.Win == "H") & (df$OT.SO ==
"OT")]))
tmpTable$HomeSOW = as.numeric(table(df$HomeTeam[(df$OT.Win == "H") & (df$OT.SO ==
"SO")]))
tmpTable$AwayOTW = as.numeric(table(df$AwayTeam[(df$OT.Win == "A") & (df$OT.SO ==
"OT")]))
tmpTable$AwaySOW = as.numeric(table(df$AwayTeam[(df$OT.Win == "A") & (df$OT.SO ==
"SO")]))
# OT Losses
tmpTable$HomeOTL = as.numeric(table(df$HomeTeam[(df$OT.Win == "A")]))
tmpTable$AwayOTL = as.numeric(table(df$AwayTeam[(df$OT.Win == "H")]))
# W/L/OTL/ROW
tmpTable$GP = tmpTable$HomeGames + tmpTable$AwayGames
tmpTable$W = tmpTable$HomeWin + tmpTable$AwayWin + tmpTable$HomeOTW + tmpTable$HomeSOW +
tmpTable$AwayOTW + tmpTable$AwaySOW
tmpTable$OTL = tmpTable$HomeOTL + tmpTable$AwayOTL
tmpTable$L = tmpTable$HomeLoss + tmpTable$AwayLoss
tmpTable$ROW = tmpTable$W - (tmpTable$HomeSOW + tmpTable$AwaySOW)
# Goal Diffs (includes OT scores)
tmpTable$HomeFor = as.numeric(tapply(df$HG, df$HomeTeam, sum, na.rm = TRUE)) +
tmpTable$HomeOTW + tmpTable$HomeSOW
tmpTable$HomeAgainst = as.numeric(tapply(df$AG, df$HomeTeam, sum, na.rm = TRUE)) +
tmpTable$HomeOTL
tmpTable$AwayFor = as.numeric(tapply(df$AG, df$AwayTeam, sum, na.rm = TRUE)) +
tmpTable$AwayOTW + tmpTable$AwaySOW
tmpTable$AwayAgainst = as.numeric(tapply(df$HG, df$AwayTeam, sum, na.rm = TRUE)) +
tmpTable$AwayOTL
tmpTable$GF = ifelse(is.na(tmpTable$HomeFor), 0, tmpTable$HomeFor) + ifelse(is.na(tmpTable$AwayFor),
0, tmpTable$AwayFor)
tmpTable$GA = ifelse(is.na(tmpTable$HomeAgainst), 0, tmpTable$HomeAgainst) +
ifelse(is.na(tmpTable$AwayAgainst), 0, tmpTable$AwayAgainst)
tmpTable$DIFF = tmpTable$GF - tmpTable$GA
# Additional Stats
tmpTable$P = 2 * tmpTable$W + tmpTable$OTL
tmpTable$PP = tmpTable$P/tmpTable$GP
tmpTable$OT.Win.Percent = (tmpTable$HomeOTW + tmpTable$HomeSOW + tmpTable$AwayOTW +
tmpTable$AwaySOW)/(tmpTable$HomeOTW + tmpTable$HomeSOW + tmpTable$AwayOTW +
tmpTable$AwayOTL + tmpTable$OTL)
tmpTable <- tmpTable[, c("Team", "GP", "W", "OTL", "L", "ROW", "P", "GF",
"GA", "DIFF", "PP", "OT.Win.Percent")]
tmpTable <- tmpTable[order(-tmpTable$P, -tmpTable$PP, -tmpTable$ROW, -tmpTable$DIFF),
]
rownames(tmpTable) <- 1:nrow(tmpTable)
return(tmpTable)
}
We use it by calling makeStatsTable
with the input being the season data.
Team | GP | W | OTL | L | ROW | P | GF | GA | OT.Win.Percent |
---|---|---|---|---|---|---|---|---|---|
New York Rangers | 82 | 53 | 7 | 22 | 49 | 113 | 252 | 192 | 0.5882 |
Montreal Canadiens | 82 | 50 | 10 | 22 | 43 | 110 | 221 | 189 | 0.619 |
Anaheim Ducks | 82 | 51 | 7 | 24 | 43 | 109 | 236 | 226 | 0.6667 |
St. Louis Blues | 82 | 51 | 7 | 24 | 42 | 109 | 248 | 201 | 0.6667 |
Tampa Bay Lightning | 82 | 50 | 8 | 24 | 47 | 108 | 262 | 211 | 0.3333 |
Nashville Predators | 82 | 47 | 10 | 25 | 41 | 104 | 232 | 208 | 0.5 |
Chicago Blackhawks | 82 | 48 | 6 | 28 | 39 | 102 | 229 | 189 | 0.8 |
Vancouver Canucks | 82 | 48 | 5 | 29 | 42 | 101 | 242 | 222 | 0.7059 |
Washington Capitals | 82 | 45 | 11 | 26 | 40 | 101 | 242 | 203 | 0.4 |
New York Islanders | 82 | 47 | 7 | 28 | 40 | 101 | 252 | 230 | 0.619 |
Minnesota Wild | 82 | 46 | 8 | 28 | 42 | 100 | 231 | 201 | 0.5 |
Detroit Red Wings | 82 | 43 | 14 | 25 | 39 | 100 | 235 | 221 | 0.4074 |
Ottawa Senators | 82 | 43 | 13 | 26 | 37 | 99 | 238 | 215 | 0.4333 |
Winnipeg Jets | 82 | 43 | 13 | 26 | 36 | 99 | 230 | 210 | 0.3929 |
Pittsburgh Penguins | 82 | 43 | 12 | 27 | 39 | 98 | 221 | 210 | 0.3571 |
Calgary Flames | 82 | 45 | 7 | 30 | 41 | 97 | 241 | 216 | 0.65 |
Boston Bruins | 82 | 41 | 14 | 27 | 37 | 96 | 213 | 211 | 0.4062 |
Los Angeles Kings | 82 | 40 | 15 | 27 | 38 | 95 | 220 | 205 | 0.1154 |
Dallas Stars | 82 | 41 | 10 | 31 | 37 | 92 | 261 | 260 | 0.5 |
Florida Panthers | 82 | 38 | 15 | 29 | 30 | 91 | 206 | 223 | 0.3214 |
Colorado Avalanche | 82 | 39 | 12 | 31 | 29 | 90 | 219 | 227 | 0.4138 |
San Jose Sharks | 82 | 40 | 9 | 33 | 36 | 89 | 228 | 232 | 0.375 |
Columbus Blue Jackets | 82 | 42 | 5 | 35 | 33 | 89 | 236 | 250 | 0.8235 |
Philadelphia Flyers | 82 | 33 | 18 | 31 | 30 | 84 | 215 | 234 | 0.2162 |
New Jersey Devils | 82 | 32 | 14 | 36 | 27 | 78 | 181 | 216 | 0.2308 |
Carolina Hurricanes | 82 | 30 | 11 | 41 | 25 | 71 | 188 | 226 | 0.3158 |
Toronto Maple Leafs | 82 | 30 | 8 | 44 | 25 | 68 | 211 | 262 | 0.4 |
Edmonton Oilers | 82 | 24 | 14 | 44 | 19 | 62 | 198 | 283 | 0.2414 |
Arizona Coyotes | 82 | 24 | 8 | 50 | 19 | 56 | 170 | 272 | 0.5556 |
Buffalo Sabres | 82 | 23 | 8 | 51 | 15 | 54 | 161 | 274 | 0.5625 |
Table: Table of statistics in 2015-2016 season to today.
We’ll feed it into a log5 function, and use that instead of a factor of 0.5 to determine OT winners. As well, it’s trivial to determine that on average, the number of games ending in shootout is slightly higher than the number ending in overtime.
log5.OT.predictor <- function(stats, home, away) {
# reurns chances that home team wins in OT
pa <- stats[stats$Team == home, ]$OT.Win.Percent
pb <- stats[stats$Team == away, ]$OT.Win.Percent
log5 <- (pa - (pa * pb))/(pa + pb - (2 * pa * pb))
return(log5)
}
predict_one_game <- function(pmatrix, stats, home, away) {
random <- runif(1)
while (random > pmatrix[nrow(pmatrix), ncol(pmatrix)]) {
random <- runif(1)
}
score <- as.vector(which(pmatrix > random, arr.ind = T)[1, ])
# scores is matrix c(home away)
score <- score - 1
score[3] <- NA
if (score[1] == score[2]) {
if (runif(1) < log5.OT.predictor(stats, home, away)) {
# Home Win
score[1] <- score[1] + 1
} else {
score[2] <- score[2] + 1
}
if (runif(1) < 0.56) {
score[3] <- "SO"
} else {
score[3] <- "OT"
}
}
return(score)
}
Finally, one more round of predictions:
t(replicate(10, predictOneGame(pmatrix, home, away)))
## [,1] [,2] [,3]
## [1,] 7 0 NA
## [2,] 4 1 NA
## [3,] 7 1 NA
## [4,] 2 0 NA
## [5,] 7 1 NA
## [6,] 3 4 NA
## [7,] 5 1 NA
## [8,] 3 2 NA
## [9,] 5 2 NA
## [10,] 3 2 NA
Next time we’ll look at predicting a whole sesason.