Evaluating the Models
Note: This is earlier work I did (last winter/spring) so some info may seem dated at time of posting. I’ve used data files current to then.
Last post we predicted the results of the remainder of the season. It’s exciting to know that your favourite team might make the playoffs, but how can you trust the model? We haven’t performed any validation so far. Maybe all the work we’ve done is a worse predictor than a 50/50 split of winners? Lets dive in and find out.
We’ll start by evaluating the models for ‘RPS’ (Rank Probability Score). There’s a great discussion of why this is important by Anthony Constantinou that you can read here (pdf). The great thing is, the RPS calculation is already available in R from the verification
package.
The RPS formula takes a matrix of probabilities and a result, and returns the score based on how close the model was. For example, if team A (away) was given a 0.6 chance of winning, team B (home) a 0.25, and a draw of 0.15, then the probability set is {A,D,H} = {0.6,0.15,0.25}. If Team B wins, then the result is 3, the third column. The formula for RPS is:
where: r is the number of potential outcomes (in our case, 3), p j is the pobability of outcome at position j, and e j is the actual outcome at that position.
Our example looks as this:
RPS | ||
---|---|---|
{0.6,0.75(,1)} | {0,0(,1)} | 0.46125 |
Normally, the summation sets don’t include the 1, that value is implied.
A smaller RPS value is better, so when we get a ‘home’ win with RPS = 0.46125, or an away win with RPS = 0.11125, we can say that the model ‘performed better’ with the prediction of an away win. There are more examples of performance in the above linked paper by Constantinou.
To evaluate each model, let’s compare the results of 2014-2015 season to our predictions. We’ll start with training the model with all the games up to December 31 (approximately the first half of the season), and compare the predicted results to the actuals from then to the end of the season. We can experimentally determine a better value for ξ (see the time-weight dependance post) than the 0.005 we’ve tossed around.
To start, let’s build a 2015 testing data (a schedule), 2015 training data, 2015 known results, and 2005-2014 data for longer predictions.
Now, we can start running the optimizer on our data sets to get our Dixon-Coles parameters. Recall that these are long processes.
Now with our res sets, we can evaulate the performance of each to the actual data. Instead of predicting a score, all we need to do is get the proportion of A,D,H for each game. We can recycle earlier code with a different return to get this information:
We can get our ADH probability for the test data quite simply. We take it as a matrix to make it ready to feed into the rps function:
Now, lets look at the rps results to see how they compare. For 2015 only data in the model, our RPS value is 0.233542, and for all the available data, it’s 0.2334388. Not a huge difference. But, lets’ optimize the xi value by using RPS.
So we see that for the 2015 data only, we get an optimal xi value of 0.0184461, with an RPS of 0.2334589. We can do optimization for the full data as well. You’ll have to believe me when I say that the optimal xi value takes a lot longer to elucidate, but it’s 0.0191464, giving us an RPS of0.2407655.
Let’s compare some more methods, to know how well this model really compares. We don’t know if this RPS value is good or bad, so let’s use some dummy data. If the home team wins every game, our odds are {0,0,1}
. If we use the ‘50-50’ win method, our odds for each game are {0.5,0,0.5}
. If we consider OT/SO results, we’ll have approximately {0.4,0.2,0.4}
for each game. the RPS for both is calculated. One final matrix of randomly generated proportions is tested too.
The distribution of each (Away, Draw, Home) is shown in this plot:
This gives us the following results:
Method | RPS |
---|---|
Short Training | 0.233542 |
Long Training | 0.2334388 |
Short Training (time weighted) | 0.2334589 |
Long Training (time weighted) | 0.2407655 |
Home Wins Always | 0.4836795 |
Even Odds (No OT) | 0.25 |
Even Odds (Plus OT) | 0.2365579 |
Randomly Generated Odds | 0.2717466 |
Looks like there’s some work to do yet on improving models to be better than a weighted coin flip!