For next season there will be fairly significant changes to the debate ratings algorithm that will, I believe, mark substantial improvements both in terms of the accuracy of its predictions as well as how well it meets the "eye test." This is a long post, so I'll summarize the short and sweet of it here. If you're interested in the details, there is a lot to read below. I go into some depth concerning rationale and supporting data for the decisions that I've made.
The ratings were previously based on the Glicko algorithm developed by Mark Glickman (which were themselves inspired by the Elo rating system developed for chess competition by Arpad Elo). In the upcoming season, the debate ratings will instead shift to an adaptation of the TrueSkill algorithm which was developed by Microsoft Research. Some of you (who should feel guilty for not cutting enough cards) may be familiar with the TrueSkill system as the basis of the algorithm Microsoft uses for matchmaking in its Xbox video games. The logic undergirding TrueSkill is very similar to the Glicko rating system -- notably, they both use Bayesian inference methods and assume that an individual skill rating can be represented as a normal distribution with mean and deviation -- but they use different mathematical tools to get there. The debate ratings are my own adaptation that attempts to apply these mathematical tools to the peculiarities of college policy debate.
The reason for the change is simple: TrueSkill does a better job given the specific needs of the policy debate season. The bulk of this post will attempt to unpack the reasons why this is true. For now, it can just be said that TrueSkill results in ratings that more accurately predict actual round results as well as more closely reflect the wisdom of first and second round at-large bid votes.
When Ratings are "Too Right"
I think if I were to summarize the fundamental limitation of the Glicko ratings, it would be that they are too conservative. The system has a tendency to underestimate the differences between teams and make predictions that understate the favorite's win probability. To put it in more concrete terms, the ratings might suggest that one team is a 2:1 favorite over another when, in reality, they are closer to a 3:1 favorite -- or in a more distorting scenario, a 10:1 favorite when the truth is actually that it would be a miracle for the underdog to win even one out of fifty.
The somewhat counter-intuitive consequence of this is that the ratings can be less accurate even as they are better at avoiding making big mistakes. The reason for this is that the accuracy of the ratings depends not just on them picking the right winners but also their ability to predict the rate at which the favorite will also lose. In other words, for example, in order for the ratings to be accurate in the aggregate, the underdog should win one out of every five debates in which the odds are 4:1 against them. If favorites substantially outperform their expected record, then that indicates that the ratings are not making good predictions because they are not adequately spread out.
Here is a graph depicting the retrodictions made by the final 2015-16 Glicko ratings. The x-axis is the expected win probability of the favorite, and the y-axis is the rate at which the ratings chose the correct winner. Ideally, we would want to see a perfectly diagonal line (i.e., for all the times the ratings suggest a 75% favorite, the favorite should win 75% of the time). Instead, what we find is a curve, indicating that favorites are winning a fair deal more than the ratings think they "should" be. For example, those that the ratings expect to be 75% favorites are actually winning over 85% of their rounds.
Why does this discrepancy matter? The biggest reason is that if the predictions are too conservative, then the favorite may get "too much credit" for the win (or not enough blame for the loss). Since a team's rating goes up or down based on the difference between what the ratings expect to happen and what actually happens, an erroneous pre-round expectation leads to an erroneous post-round ratings update. The place where this appears to have created the biggest problem is situations where above average teams have been able to feast on lesser competition. At a typical tournament, power matching will help to ensure that a team will have to face off against an appropriate level of competition. However, this check might disappear entirely if a solid national level competitor travels to a smaller regional tournament where they are clearly at the top. As a result, after exceptional runs through regional competition, some teams have seen extraordinary bumps to their ratings that were perhaps unearned.
The graph for the new algorithm looks much better:
It's not a perfectly diagonal line, but it is much closer (75% favorites are winning around 77.5% of their rounds). TrueSkill accomplishes this by more effectively spreading the ratings out. For example, out of a total of over 6000 rounds in 2015-16, there were only a bit over 400 that the Glicko ratings considered to have heavy favorites with over 9:1 odds. By contrast, there were nearly 2000 rounds that TrueSkill considered to have these odds (parenthetically, I'm not sure what moral we should take from the fact that somewhere between a quarter and a third of all debates are virtually decided before they even happen).
By giving more spread between teams, the new algorithm will make it much less likely for a team to disproportionately benefit from defeating lesser competition.
The Problem of Distinguishing Prelims from Elims
One of the major concerns about the previous iteration of the ratings was that the system failed to distinguish between results that happen in the prelims versus the elims. I don't care to rehearse all of the arguments made for why elim rounds are different from prelim rounds. In the past, I have expressed some reservation about these arguments not because I necessarily disagree with their logic, but rather because there are a number of mathematical as well as conceptual questions that have to be answered before moving forward.
The first, and maybe most important, question concerns what end we're trying to accomplish by weighing elims differently. Are we actually looking for a quantitative statistical measure, or do we just want a rating that validates our qualitative impressions of what it means to win an elim round or tournament? To be clear, I’m not trying to give any value to the terms qualitative/quantitative or treat them as a mutually exclusive binary. I just think it’s important to know what we want. Maybe another way of saying this, especially apropos of elim success, is: where do you come down on the Kobe/LeBron debate? Is Kobe the greater one because of the Ringzzz and the clutchness and the assassin’s mentality, or LeBron because of PER, BPM, adjusted +/-, etc? I had an exchange with Strauss recently and he argued that no amount of NBA finals losses could ever add up to a single championship. We could extend the question to debate. Does any amount of prelim wins add up to an elim win? A tournament championship? If the goal is to value tournament championships then you don’t need fancy math to figure that out. Just count ‘em up. It’s easy.
Assuming that we are actually looking for a quantitative measurement, that raises a couple of follow-up issues.
For any kind of statistical quantification, an "a priori" decision has to be made concerning what is being measured -- the referent that attaches the stat to some meaningful part of reality. For the current algorithm, the referent is the ballot. The quality of the ratings is measured against how well it predicts/retrodicts ballots. Any tweaks can be evaluated based on whether they successfully increase the accuracy of the predictions.
Without a specific and measurable object against which to measure, a stat runs the risk of becoming arbitrary. This is why I have big objections to any kind of stat that assigns arbitrary or indiscriminate weight to specific rounds or "elim depth" (“finals at Wake is 100 points, quarters at GSU is 20, octas at Navy is 5, round 6 at Emporia is 1, etc”). This is the worst kind of stat: a qualitative evaluation (which there’s nothing wrong with on its own terms) masquerading as “hard” numerical quantification.
Given the need for a referent, there are a couple of ways of going about differentiating elims that I can think of:
1. Keep the referent the same (the ballot) but weigh elims differently. This would maybe be the easiest to implement in terms of the inputs, but there would be a potentially problematic elision that happens. See, the trick of the ratings is that the input and the output are actually basically the same thing. You input ballot results, and what you get out is a number that serves as a predictor of … ballot results. If you weigh elims differently, then you are actually now inputting two different variables. Not necessarily a huge problem unless the new variable distorts the correspondence of the output to the referent (i.e. makes it worse at predicting ballot results).
2. Change the referent (and also weigh elims differently). It would potentially be possible to create a rating with the express purpose of predicting elim wins rather than overall wins. However, we would still need to be clear about what specifically we want to predict. Tournament championships? All elim wins? At all tournaments or only the majors (and how do you define the majors)? Just the NDT? The big obstacle this would run up against is sample size (both of inputs and of results against which to test retrodictions). While a handful of teams may get 20ish regular season elim rounds, even good teams will have far less (there were first round bids with less than 10). Then you have the vast vast majority of people who are lucky to see one elim round (especially at a major). At its extreme, it is possible that this would be a stat that would only even hope to be statistically meaningful for a handful of teams (and even for those I would still have concerns about sample validity).
3. Running parallel overall and elim predictors would certainly be possible. However, beyond the fact that it would still have to address the elim sample size problem, my other concern here is ethical/political. Many (perhaps most) of the people that have contacted me to give support for doing the ratings have been from teams that are not at the very top. Maybe in some ways this is not surprising because those at the top already receive a lot of recognition. The ratings are one form of evidence of success for teams that otherwise may not receive a ton of it. It is not hard to predict that if one rating were designed expressly for the top 5-10 (or even 25) teams that the other would be devalued.
Nevertheless, even given my concerns, I do still think there is potential value in figuring out an effective way to weigh elims differently, and I think that I have discovered one that, while not perfect, does go a long way to help address the objections. The primary determinant of the ratings update algorithm will still be the quality of a team's opponent, but it is possible to apply a multiplier to the calculation for elim debates.
Below is a table that uses data from the past four seasons: a total of over 28,000 rounds, including about 2200 elims. There's a lot of information to process here, so I'll try to simplify it. Each row is a different iteration of the ratings using escalating elim win multipliers (EWM). I've also included the old Glicko rating as a point of comparison. The column boxes are four different metrics by which the accuracy of the ratings can be evaluated: 1) how well the final ratings retrodict all debate rounds from the year, 2) how well they retrodict only elimination rounds, 3) how well ratings that use only data through the holiday swing tournaments can predict all results after the swings, and 4) how well they predict only post-swing elim debates. "Correct" is the percentage that the ratings picked the right winner, MAE is the mean absolute degree of error, and MSE is the mean squared degree of error. Without going into great detail, MSE is different from MAE in that it magnifies the consequences of big misses.
Blue is good. Red is not so good.
One thing is immediately apparent: every version of TrueSkill performs substantially better than Glicko by just about every metric.
The second thing sticks out is that there is no easy and direct relationship between weighing elims and helping or hurting the accuracy of the ratings. It depends on which metric you prioritize:
I want to briefly unpack what is happening in a couple of the sections of the table, especially in the Elim Retrodictions portion. While the Elim Retrodictions provides some important information, there is a risk of imputing too much significance to its findings. Because of limitations in the sample and the fact that these are retrodictions (rather than predictions), there is a risk of overfitting the model to the peculiarities, randomness and noise of a small set of past results rather than providing a generalized model with predictive power. The stark contrast between the Elim Retrodictions and the Post-Swing Elim Predictions boxes helps to highlight the problem. Both recognize that weighting elim wins helps the accuracy of the ratings in its evaluation of elim debates, but they disagree over which level of weighting is optimal. When we try to retrodict the past, a very high elim win multiplier works best, but when we try to predict the future, things become more complicated.
The numbers in the table are relatively abstract, so I'm going to take a detour that I think should help to show how these numbers play out in more concrete terms.
NDT At-large Bid Voting as a Measure of Validity
Strictly speaking, the ratings do not attempt to replicate/predict the at-large bids for the NDT. However, the bid voters probably represent the clearest proxy that we have for "conventional wisdom" or the judgment of the community, and we can use the votes as a way to externally check how well the ratings produce results that are in line with the expert judgment of human beings that are able to account for various contextual factors outside the scope of the information available to the ratings algorithm.
Here is a table that uses data from the last four years that shows how each iteration of the ratings with different levels of elim win multipliers compare. It contrasts the actual bid vote results against the hypothetical ballots that would be produced by the ratings algorithm. MAE is the average (mean) amount that the computer ballot deviated per team from the final aggregate vote. MAE Rnk is how this would have ranked among the human ballots. So, for example, over the last four years, the basic TrueSkill algorithm without any elim multiplier produced first round at-large bid ballots that rated teams, on average, within about 1.65 spots of where they actually ended up. This would average as the 12th best human voter per year. MSE and MSE Rnk are similar, except they are weighted to magnify the consequence of larger errors (big misses).
For comparison, I've also included the average errors of the human voters themselves at the top. Once again, blue is good, red is not so good.
This is where things get interesting. While none of the computer algorithm ballots have been as good as the average bid voter when it comes to First Round Bids, the distance separating them is not massive. In particular, the TrueSkill algorithm with an elim win multiplier of 3 would rank as just a little below the average bid vote. Over the last four years, it would ring in as, on average, the 9th best voter as measured by MSE. While this may not seem spectacular, it does mean that the ratings produce results that are well within the range of expert human judgment.
Perhaps more significantly, the ratings have actually been better than the average bid voter when it comes to Second Round Bid voting. At lower levels of elim win multiplier, the ratings would rank, on average, as around the 5th or 6th most accurate voter over the last four years. In fact, the TrueSkill ratings with an elim win multiplier of 3 would have produced a ballot for the 2015-16 season that would have resembled the final aggregate vote more closely than any of the human voters.
The other thing that is quickly apparent from the table is that higher levels of EWM produce results that are wildly divergent from the judgment of bid voters. One of the big reasons for this is that excessive weight on elim rounds will drastically magnify the recency effect of the ratings. Those who do well at the last tournament of the year will see a significant boost in their rating that can't be checked back by subsequent information.
Elim Weight Conclusion
The numbers have convinced me that it is possible to give added weight to elim debates within the parameters of the TrueSkill algorithm in a way that helps the ratings more closely reflect the collective common sense of the community without jeopardizing the accuracy of the system's predictions -- in fact, in some ways the predictions may be enhanced.
While there is no clear answer on exactly how much extra weight elimination rounds should receive, I have decided on a multiplier of 3 as the Goldilocks option. It seems to hit the sweet spot with regard to the judgment of bid voters, and while not perfect by any of the prediction accuracy metrics, it does manage to make the ratings more accurate by many measures. At the end of the 2016-17 season, I will integrate the new data and reevaluate to determine if a change should be made.
Evaluating Opponent Strength Early in the Season
There is one other major change in the ratings that may be just as important as the shift in the basic algorithm: a way of dealing with lack of information early in the season. The solution involves running the algorithm twice: once to form a data frame of provisional opponent ratings, and a second time to formulate each team's actual rating.
For both Glicko and TrueSkill (as well as Elo) rating systems, the difference in ratings between two opponents indicates the probability of the outcome of a debate between them. A small ratings difference indicates fairly evenly matched teams, while a large ratings difference suggests that one team is a heavy favorite. Team ratings go up or down based on the difference between the predicted outcome of the debate and the actual outcome. After the round, each team's ratings will be recalculated based on how they performed against expectations. So, if a team defeats an opponent that it was heavily expected to defeat, then its rating may barely move at all. But if an underdog overcomes the odds and wins a big upset, then their rating would move a much larger amount. Evenly matched opponents will experience changes somewhere in the middle. As a result, opponent strength is integrated from the beginning of the calculation. Wins over stronger opponents are worth more because there is a larger difference between actual outcome and predicted outcome.
The difficulty arises at the beginning of the season when there is a lack of information to form a reliable rating. In order to formulate a prediction concerning the outcome of a given debate the ratings need to be able to assess the strength of each team. If too few of rounds have yet occurred, then the algorithm's prediction is far less reliable. This can be seen at the extreme before round one of the first tournament of the year when zero information is available to form a prior expectation. The ratings are a blank slate in this moment and incapable of distinguishing whether you are debating against college novices fresh out of their first interest meeting or last year's national champions.
In its previous iteration, the ratings relied on one very helpful tool to cope with this problem: deviation. Each team's rating distribution is defined by two parts: the mean of its skill distribution and the breadth of the variation in that skill distribution. In more basic stats terms, this can be understood as similar to a confidence interval. The algorithm expresses more confidence in a team's rating as its deviation goes down. It uses this confidence to weight how much a single debate round can influence a team's rating. A team with a large deviation will see their rating fluctuate rapidly (and all teams start the season with very large deviations), while a team with a low deviation will have the weight of their previous rounds prevent a new round result from having too much influence. Deviation goes down as you get more rounds.
Deviation helps the ratings cope with lack of information at the beginning of the season. The default is that each team begins the season with a very large deviation, which is basically the algorithm's attempt to acknowledge that it is not confident in the mean rating. Since deviation is used to weight the post-round ratings update such that those with large deviations experience larger changes, this allows such a team's rating to more to more quickly self-correct from earlier inaccurate predictions. Additionally, losses to teams with high deviations have less of an effect than to those with low deviations.
While deviation helps to significantly mitigate the effect of limited information at the beginning of the season, it does not entirely resolve the problem. The effects of erroneous predictions are substantially evened out over time, but they are never completely eliminated and can add up. For an individual team the effect will be quite small, often negligible. However, the recent trend toward segregation in the early travel schedule magnifies the problem, especially if there is a disparity in the strength of competition at the different tournaments. If there is no prior information on the teams, the ratings are unable to distinguish between an undefeated team at one tournament versus another.
The current iteration of the ratings relies on the eventual merging of the segregated pools of debaters to even things out over time. Given enough time and intermingling of teams, it would. Unfortunately, the debate season is a closed cycle, and the ratings would be helped if they could accelerate the process.
The solution to this problem is actually relatively simple. If the problem is a lack of information to form a reliable rating for assessing how good your opponent is, then what we need to do is give the algorithm more information. One way to do this would be to go into the past, using results from previous seasons to form an estimate of the team's skill. However, beyond the fact that this doesn't address the lack of information on first year debaters or the complexities of new partnerships, I find this solution undesirable because it also violates the singularity of each season.
Instead, what we can do is use information from the future to gain a more accurate picture of opponent quality. It is possible to use results from subsequent rounds to form a better estimate of how good a given opponent is. What the new ratings do is effectively run the ratings algorithm twice. On the first pass, it creates a provisional rating for each team that uses all available information -- for example, when I update the ratings in January after the swing tournaments, it will use all rounds from the beginning of the season through those tournaments. On the second pass, it will use those provisional ratings in its predictions to estimate opponent strength until such a time as that opponent has a sufficiently reliable rating.
To be clear, this does not involve double counting. The provisional rating is only ever used to evaluate how strong an opponent is. The second pass starts each team's actual rating from scratch. When Team A debates Team B in round one of the season opener, the ratings create a separate prediction for each side. One prediction will be between the ratings for a blank slate Team A versus a reliable Team B; the other between a blank slate Team B and a reliable Team A. The first will be used to update Team A's rating, the latter to update Team B's rating. The algorithm eventually stops using a team's provisional rating once that team's actual rating becomes reliable enough (i.e., its deviation becomes small enough, a length of time that varies, but is usually reached in the neighborhood of 25 rounds).
Other Small Changes
There are a couple of other small changes that will have some effect on how the rankings are calculated.
The first concerns how the final ratings are determined by subtracting a team's deviation from their rating to produce an "adjusted rating." The original reason for doing this is that it gives a more "confident" rating by adjusting downward those teams that we have less data about. In effect, it says that we are confident that a team is "at least" as good as their adjusted rating. If two teams have about the same mean rating but one has significantly fewer rounds then the other, then we should be less confident that their rating is accurate.
While helpful to weed out teams with high deviations, there is a limit to the usefulness of this procedure. When most regularly travelling teams end the season with somewhere between 80 and 100 rounds, it is somewhat silly to use deviation as a tool to delineate between them. This past year, there were a few examples even in the top 25 where a lower rated team was able to jump a higher rated team merely because they had a few more debates under their belts.
In the future, the ratings will continue to use deviation as a tool to adjust the ratings of teams, but it will stop making delineations once teams reach a certain threshold. This threshold will be calculated as the median of the 100 smallest deviations.
A second change is that the ratings will no longer attempt to model individual debaters with multiple partnerships. Instead, it will treat each two person team as a discrete unit. The obstacles to being able to model multiple partnerships are just too large, primarily because we just don't collect the kind of data that would make it possible. How much is each partner responsible for a win? Does this question even make sense to ask? We all know that a great debater can carry a poor partner to a lot of wins. But we also know that even a good debater will lose rounds that they otherwise wouldn't have if they travel with a partner with lesser skill.
I know that this may disadvantage some debaters who are forced to frequently change partners, but it would be generous to even say that my previous attempts to solve the problem looked like trying to duct tape a windshield on. It kinda sorta worked, but mostly by luck, and even still had the effect of perhaps unfairly harming the ratings of some debaters.
Finally, I have removed the eigenvector centrality component of the previous system. This was originally a way to ensure that a team possessed a set of rounds that were adequately integrated into the larger community pool. TrueSkill doesn't need it.
I've attached a copy of the R code that I use to run the algorithm. I make no claims to being a good coder. What little I know is self-taught. It's slow, but it gets the job done. XML files of tournament results are available for download on tabroom by using their api.
Resources concerning TrueSkill can be found at Microsoft Research. There is a great summary here, and a really good in-depth explanation of the mathematical principles at work has been written by Jeff Moser.