In earlier posts, I discussed the final end of year ratings from the 2013-14 data and the selection of the first round at-large bids for the NDT. Another way to evaluate the system is see how well the ratings at the end of the regular season are able to predict round results at nationals. Because of the gradual accumulation of data over the course of the year, this should be the point at which the ratings can claim their strongest accuracy.
Prior to each matchup, a prediction is made on the probable outcome based on each team's rating. Without going into too much detail concerning the formula, which can be found here, a ratings difference of 100 suggests a 64% chance of winning for the favorite. A difference of 200 translates into about 76%, and a 400 point difference is about 91%. In addition, the ratings deviation is factored into the prediction. The results of the round are then scored (1 for a win, 0 for a loss, the fraction of the ballot count for a panel), and the difference between the predicted outcome and the actual outcome becomes the basis for the updated post-matchup ratings.
For example, say the system gives Team A a 73% chance of winning against Team B. If Team A wins on a 3-0, then they exceeded the prediction (100% > 73%), and Team A's rating will rise marginally. However, if Team A only wins by a ballot count of 2-1, they actually fell short of expectations (67% < 73%), and their rating will actually go down a hair. If, on the other hand, Team A loses all 3 ballots, then the ratings loss will be relatively larger than their gain for a win would have been (because the difference between 0% and 73% is greater than the difference between 100% and 73%).
So one thing that we want to do is minimize the error in the round predictions. Obviously error can't be eliminated because upsets do occur (especially among those with fairly close ratings). Even if a favorite is 73% likely to win, that still means that the system thinks that they should lose 27% of the time. This effect is magnified the smaller the ratings difference. A 45% underdog should still win nearly half their debates against a 55% favorite.
Nevertheless, we should be able to gauge how well the ratings are working based on how large the difference is between their predictions and the actual results. Looking at the NDT gives some advantages because it's the one tournament where the results are not merely binary (1 or 0) because ballot counts create more of a spectrum (0, .33, .67. 1).
The mean absolute error for the predictions at the NDT was .253 for the weighted ratings and .257 for the unweighted ratings. That means that the predictions were on average within a ballot of the actual results. Below is a histogram showing the absolute values of the errors for all rounds with the weighted ratings. Lower is more accurate, higher is less accurate.
Since we're dealing with judge panels at the NDT, we can be a bit more precise than what's allowed by the binary win/loss structure of most tournaments. It's possible to crudely translate the degree of error into more concrete terms as ballot counts. In that case, we get these rates of error:
0 - 0.5 ballots: 40.1%
0.5 - 1 ballots: 29.3%
1 - 1.5 ballots: 19.9%
1.5 - 2 ballots: 8.7%
2 - 2.5 ballots: 2.2%
2.5 - 3 ballots: 0%
I feel that these are pretty good numbers. It means that the ratings picked roughly 90% of rounds "correctly" (meaning an error of under 0.500), with 70% being under a ballot away. Additionally, as I look at the data, the vast majority of the instances of 1.5 - 2 ballot error were occasions where the ratings predicted fairly even matched opponents but the slight underdog ended up winning on a 3-0.
It's hard to know exactly how "good" these predictions are in some abstract sense. Since we (wisely) don't gamble on debate rounds, we don't really have a way to measure what the community consensus on round odds would be. As a result, there's no comparison point against which to evaluate the predictions. I'm actually somewhat surprised at how good the predictions ended up, especially considering how power-matching is intended to pair teams with like ability.
It bears repeating that no prediction system can be 100% accurate (or even close) because there is variation in debate results with various degrees of upsets. Indeed, it would be somewhat sad if we could develop a prediction system that were too accurate because it would imply that round results are predestined. Furthermore, even a "correct" pick will be calculated as having some degree of error. Hypothetically, the ratings could pick a team as a 95% favorite, but when they win there will still be a 5% "error." In fact, it is this error that makes the ratings run because the degree of error is what determines a team's ratings change after a victory or loss.