The ratings are back! I have my +resist to fire eq ready to go. Usually I would list 100 teams, but limitations in the data mean that I am only listing 50 for now.
If you feel strongly that you belong on this list but don't see your name, it may be possible that this is due to the ratings not having enough data on you. The deviation for your rating has to be below 2.2 to be listed (which roughly amounts to around 18 to 20 rounds). I count somewhere around 25 teams that might have a good enough rating but lack the number of rounds necessary to give the system enough confidence.
If you feel strongly that some teams are not correctly ranked, consider:
There have been significant changes to the ratings algorithm from last year. For a detailed description, please follow this link. The changes can be briefly summarized as follows:
For a sense of what the ratings number actually means:
If you are attentive to the rating number - not just the ranking - it will help you to understand even large differences in ranking might not amount to very much difference between teams. For example, the difference between the 13th ranked team and the 21st ranked team is only about 1 point, meaning that a debate between them would be treated as little more than a coin flip.
If you follow the link given above, you will find some graphics showing how the new ratings algorithm has performed when using data from the past four years. While the ratings don't explicitly set out to predict at large bids, they would have produced ballots well within the range of error of the actual human voters. They would have performed as a slightly below average voter for first round bids, but would have actually been an above average voter for second round bids.
For next season there will be fairly significant changes to the debate ratings algorithm that will, I believe, mark substantial improvements both in terms of the accuracy of its predictions as well as how well it meets the "eye test." This is a long post, so I'll summarize the short and sweet of it here. If you're interested in the details, there is a lot to read below. I go into some depth concerning rationale and supporting data for the decisions that I've made.
The ratings were previously based on the Glicko algorithm developed by Mark Glickman (which were themselves inspired by the Elo rating system developed for chess competition by Arpad Elo). In the upcoming season, the debate ratings will instead shift to an adaptation of the TrueSkill algorithm which was developed by Microsoft Research. Some of you (who should feel guilty for not cutting enough cards) may be familiar with the TrueSkill system as the basis of the algorithm Microsoft uses for matchmaking in its Xbox video games. The logic undergirding TrueSkill is very similar to the Glicko rating system -- notably, they both use Bayesian inference methods and assume that an individual skill rating can be represented as a normal distribution with mean and deviation -- but they use different mathematical tools to get there. The debate ratings are my own adaptation that attempts to apply these mathematical tools to the peculiarities of college policy debate.
The reason for the change is simple: TrueSkill does a better job given the specific needs of the policy debate season. The bulk of this post will attempt to unpack the reasons why this is true. For now, it can just be said that TrueSkill results in ratings that more accurately predict actual round results as well as more closely reflect the wisdom of first and second round at-large bid votes.
When Ratings are "Too Right"
I think if I were to summarize the fundamental limitation of the Glicko ratings, it would be that they are too conservative. The system has a tendency to underestimate the differences between teams and make predictions that understate the favorite's win probability. To put it in more concrete terms, the ratings might suggest that one team is a 2:1 favorite over another when, in reality, they are closer to a 3:1 favorite -- or in a more distorting scenario, a 10:1 favorite when the truth is actually that it would be a miracle for the underdog to win even one out of fifty.
The somewhat counter-intuitive consequence of this is that the ratings can be less accurate even as they are better at avoiding making big mistakes. The reason for this is that the accuracy of the ratings depends not just on them picking the right winners but also their ability to predict the rate at which the favorite will also lose. In other words, for example, in order for the ratings to be accurate in the aggregate, the underdog should win one out of every five debates in which the odds are 4:1 against them. If favorites substantially outperform their expected record, then that indicates that the ratings are not making good predictions because they are not adequately spread out.
Here is a graph depicting the retrodictions made by the final 2015-16 Glicko ratings. The x-axis is the expected win probability of the favorite, and the y-axis is the rate at which the ratings chose the correct winner. Ideally, we would want to see a perfectly diagonal line (i.e., for all the times the ratings suggest a 75% favorite, the favorite should win 75% of the time). Instead, what we find is a curve, indicating that favorites are winning a fair deal more than the ratings think they "should" be. For example, those that the ratings expect to be 75% favorites are actually winning over 85% of their rounds.
Why does this discrepancy matter? The biggest reason is that if the predictions are too conservative, then the favorite may get "too much credit" for the win (or not enough blame for the loss). Since a team's rating goes up or down based on the difference between what the ratings expect to happen and what actually happens, an erroneous pre-round expectation leads to an erroneous post-round ratings update. The place where this appears to have created the biggest problem is situations where above average teams have been able to feast on lesser competition. At a typical tournament, power matching will help to ensure that a team will have to face off against an appropriate level of competition. However, this check might disappear entirely if a solid national level competitor travels to a smaller regional tournament where they are clearly at the top. As a result, after exceptional runs through regional competition, some teams have seen extraordinary bumps to their ratings that were perhaps unearned.
The graph for the new algorithm looks much better:
It's not a perfectly diagonal line, but it is much closer (75% favorites are winning around 77.5% of their rounds). TrueSkill accomplishes this by more effectively spreading the ratings out. For example, out of a total of over 6000 rounds in 2015-16, there were only a bit over 400 that the Glicko ratings considered to have heavy favorites with over 9:1 odds. By contrast, there were nearly 2000 rounds that TrueSkill considered to have these odds (parenthetically, I'm not sure what moral we should take from the fact that somewhere between a quarter and a third of all debates are virtually decided before they even happen).
By giving more spread between teams, the new algorithm will make it much less likely for a team to disproportionately benefit from defeating lesser competition.
The Problem of Distinguishing Prelims from Elims
One of the major concerns about the previous iteration of the ratings was that the system failed to distinguish between results that happen in the prelims versus the elims. I don't care to rehearse all of the arguments made for why elim rounds are different from prelim rounds. In the past, I have expressed some reservation about these arguments not because I necessarily disagree with their logic, but rather because there are a number of mathematical as well as conceptual questions that have to be answered before moving forward.
The first, and maybe most important, question concerns what end we're trying to accomplish by weighing elims differently. Are we actually looking for a quantitative statistical measure, or do we just want a rating that validates our qualitative impressions of what it means to win an elim round or tournament? To be clear, I’m not trying to give any value to the terms qualitative/quantitative or treat them as a mutually exclusive binary. I just think it’s important to know what we want. Maybe another way of saying this, especially apropos of elim success, is: where do you come down on the Kobe/LeBron debate? Is Kobe the greater one because of the Ringzzz and the clutchness and the assassin’s mentality, or LeBron because of PER, BPM, adjusted +/-, etc? I had an exchange with Strauss recently and he argued that no amount of NBA finals losses could ever add up to a single championship. We could extend the question to debate. Does any amount of prelim wins add up to an elim win? A tournament championship? If the goal is to value tournament championships then you don’t need fancy math to figure that out. Just count ‘em up. It’s easy.
Assuming that we are actually looking for a quantitative measurement, that raises a couple of follow-up issues.
For any kind of statistical quantification, an "a priori" decision has to be made concerning what is being measured -- the referent that attaches the stat to some meaningful part of reality. For the current algorithm, the referent is the ballot. The quality of the ratings is measured against how well it predicts/retrodicts ballots. Any tweaks can be evaluated based on whether they successfully increase the accuracy of the predictions.
Without a specific and measurable object against which to measure, a stat runs the risk of becoming arbitrary. This is why I have big objections to any kind of stat that assigns arbitrary or indiscriminate weight to specific rounds or "elim depth" (“finals at Wake is 100 points, quarters at GSU is 20, octas at Navy is 5, round 6 at Emporia is 1, etc”). This is the worst kind of stat: a qualitative evaluation (which there’s nothing wrong with on its own terms) masquerading as “hard” numerical quantification.
Given the need for a referent, there are a couple of ways of going about differentiating elims that I can think of:
1. Keep the referent the same (the ballot) but weigh elims differently. This would maybe be the easiest to implement in terms of the inputs, but there would be a potentially problematic elision that happens. See, the trick of the ratings is that the input and the output are actually basically the same thing. You input ballot results, and what you get out is a number that serves as a predictor of … ballot results. If you weigh elims differently, then you are actually now inputting two different variables. Not necessarily a huge problem unless the new variable distorts the correspondence of the output to the referent (i.e. makes it worse at predicting ballot results).
2. Change the referent (and also weigh elims differently). It would potentially be possible to create a rating with the express purpose of predicting elim wins rather than overall wins. However, we would still need to be clear about what specifically we want to predict. Tournament championships? All elim wins? At all tournaments or only the majors (and how do you define the majors)? Just the NDT? The big obstacle this would run up against is sample size (both of inputs and of results against which to test retrodictions). While a handful of teams may get 20ish regular season elim rounds, even good teams will have far less (there were first round bids with less than 10). Then you have the vast vast majority of people who are lucky to see one elim round (especially at a major). At its extreme, it is possible that this would be a stat that would only even hope to be statistically meaningful for a handful of teams (and even for those I would still have concerns about sample validity).
3. Running parallel overall and elim predictors would certainly be possible. However, beyond the fact that it would still have to address the elim sample size problem, my other concern here is ethical/political. Many (perhaps most) of the people that have contacted me to give support for doing the ratings have been from teams that are not at the very top. Maybe in some ways this is not surprising because those at the top already receive a lot of recognition. The ratings are one form of evidence of success for teams that otherwise may not receive a ton of it. It is not hard to predict that if one rating were designed expressly for the top 5-10 (or even 25) teams that the other would be devalued.
Nevertheless, even given my concerns, I do still think there is potential value in figuring out an effective way to weigh elims differently, and I think that I have discovered one that, while not perfect, does go a long way to help address the objections. The primary determinant of the ratings update algorithm will still be the quality of a team's opponent, but it is possible to apply a multiplier to the calculation for elim debates.
Below is a table that uses data from the past four seasons: a total of over 28,000 rounds, including about 2200 elims. There's a lot of information to process here, so I'll try to simplify it. Each row is a different iteration of the ratings using escalating elim win multipliers (EWM). I've also included the old Glicko rating as a point of comparison. The column boxes are four different metrics by which the accuracy of the ratings can be evaluated: 1) how well the final ratings retrodict all debate rounds from the year, 2) how well they retrodict only elimination rounds, 3) how well ratings that use only data through the holiday swing tournaments can predict all results after the swings, and 4) how well they predict only post-swing elim debates. "Correct" is the percentage that the ratings picked the right winner, MAE is the mean absolute degree of error, and MSE is the mean squared degree of error. Without going into great detail, MSE is different from MAE in that it magnifies the consequences of big misses.
Blue is good. Red is not so good.
One thing is immediately apparent: every version of TrueSkill performs substantially better than Glicko by just about every metric.
The second thing sticks out is that there is no easy and direct relationship between weighing elims and helping or hurting the accuracy of the ratings. It depends on which metric you prioritize:
I want to briefly unpack what is happening in a couple of the sections of the table, especially in the Elim Retrodictions portion. While the Elim Retrodictions provides some important information, there is a risk of imputing too much significance to its findings. Because of limitations in the sample and the fact that these are retrodictions (rather than predictions), there is a risk of overfitting the model to the peculiarities, randomness and noise of a small set of past results rather than providing a generalized model with predictive power. The stark contrast between the Elim Retrodictions and the Post-Swing Elim Predictions boxes helps to highlight the problem. Both recognize that weighting elim wins helps the accuracy of the ratings in its evaluation of elim debates, but they disagree over which level of weighting is optimal. When we try to retrodict the past, a very high elim win multiplier works best, but when we try to predict the future, things become more complicated.
The numbers in the table are relatively abstract, so I'm going to take a detour that I think should help to show how these numbers play out in more concrete terms.
NDT At-large Bid Voting as a Measure of Validity
Strictly speaking, the ratings do not attempt to replicate/predict the at-large bids for the NDT. However, the bid voters probably represent the clearest proxy that we have for "conventional wisdom" or the judgment of the community, and we can use the votes as a way to externally check how well the ratings produce results that are in line with the expert judgment of human beings that are able to account for various contextual factors outside the scope of the information available to the ratings algorithm.
Here is a table that uses data from the last four years that shows how each iteration of the ratings with different levels of elim win multipliers compare. It contrasts the actual bid vote results against the hypothetical ballots that would be produced by the ratings algorithm. MAE is the average (mean) amount that the computer ballot deviated per team from the final aggregate vote. MAE Rnk is how this would have ranked among the human ballots. So, for example, over the last four years, the basic TrueSkill algorithm without any elim multiplier produced first round at-large bid ballots that rated teams, on average, within about 1.65 spots of where they actually ended up. This would average as the 12th best human voter per year. MSE and MSE Rnk are similar, except they are weighted to magnify the consequence of larger errors (big misses).
For comparison, I've also included the average errors of the human voters themselves at the top. Once again, blue is good, red is not so good.
This is where things get interesting. While none of the computer algorithm ballots have been as good as the average bid voter when it comes to First Round Bids, the distance separating them is not massive. In particular, the TrueSkill algorithm with an elim win multiplier of 3 would rank as just a little below the average bid vote. Over the last four years, it would ring in as, on average, the 9th best voter as measured by MSE. While this may not seem spectacular, it does mean that the ratings produce results that are well within the range of expert human judgment.
Perhaps more significantly, the ratings have actually been better than the average bid voter when it comes to Second Round Bid voting. At lower levels of elim win multiplier, the ratings would rank, on average, as around the 5th or 6th most accurate voter over the last four years. In fact, the TrueSkill ratings with an elim win multiplier of 3 would have produced a ballot for the 2015-16 season that would have resembled the final aggregate vote more closely than any of the human voters.
The other thing that is quickly apparent from the table is that higher levels of EWM produce results that are wildly divergent from the judgment of bid voters. One of the big reasons for this is that excessive weight on elim rounds will drastically magnify the recency effect of the ratings. Those who do well at the last tournament of the year will see a significant boost in their rating that can't be checked back by subsequent information.
Elim Weight Conclusion
The numbers have convinced me that it is possible to give added weight to elim debates within the parameters of the TrueSkill algorithm in a way that helps the ratings more closely reflect the collective common sense of the community without jeopardizing the accuracy of the system's predictions -- in fact, in some ways the predictions may be enhanced.
While there is no clear answer on exactly how much extra weight elimination rounds should receive, I have decided on a multiplier of 3 as the Goldilocks option. It seems to hit the sweet spot with regard to the judgment of bid voters, and while not perfect by any of the prediction accuracy metrics, it does manage to make the ratings more accurate by many measures. At the end of the 2016-17 season, I will integrate the new data and reevaluate to determine if a change should be made.
Evaluating Opponent Strength Early in the Season
There is one other major change in the ratings that may be just as important as the shift in the basic algorithm: a way of dealing with lack of information early in the season. The solution involves running the algorithm twice: once to form a data frame of provisional opponent ratings, and a second time to formulate each team's actual rating.
For both Glicko and TrueSkill (as well as Elo) rating systems, the difference in ratings between two opponents indicates the probability of the outcome of a debate between them. A small ratings difference indicates fairly evenly matched teams, while a large ratings difference suggests that one team is a heavy favorite. Team ratings go up or down based on the difference between the predicted outcome of the debate and the actual outcome. After the round, each team's ratings will be recalculated based on how they performed against expectations. So, if a team defeats an opponent that it was heavily expected to defeat, then its rating may barely move at all. But if an underdog overcomes the odds and wins a big upset, then their rating would move a much larger amount. Evenly matched opponents will experience changes somewhere in the middle. As a result, opponent strength is integrated from the beginning of the calculation. Wins over stronger opponents are worth more because there is a larger difference between actual outcome and predicted outcome.
The difficulty arises at the beginning of the season when there is a lack of information to form a reliable rating. In order to formulate a prediction concerning the outcome of a given debate the ratings need to be able to assess the strength of each team. If too few of rounds have yet occurred, then the algorithm's prediction is far less reliable. This can be seen at the extreme before round one of the first tournament of the year when zero information is available to form a prior expectation. The ratings are a blank slate in this moment and incapable of distinguishing whether you are debating against college novices fresh out of their first interest meeting or last year's national champions.
In its previous iteration, the ratings relied on one very helpful tool to cope with this problem: deviation. Each team's rating distribution is defined by two parts: the mean of its skill distribution and the breadth of the variation in that skill distribution. In more basic stats terms, this can be understood as similar to a confidence interval. The algorithm expresses more confidence in a team's rating as its deviation goes down. It uses this confidence to weight how much a single debate round can influence a team's rating. A team with a large deviation will see their rating fluctuate rapidly (and all teams start the season with very large deviations), while a team with a low deviation will have the weight of their previous rounds prevent a new round result from having too much influence. Deviation goes down as you get more rounds.
Deviation helps the ratings cope with lack of information at the beginning of the season. The default is that each team begins the season with a very large deviation, which is basically the algorithm's attempt to acknowledge that it is not confident in the mean rating. Since deviation is used to weight the post-round ratings update such that those with large deviations experience larger changes, this allows such a team's rating to more to more quickly self-correct from earlier inaccurate predictions. Additionally, losses to teams with high deviations have less of an effect than to those with low deviations.
While deviation helps to significantly mitigate the effect of limited information at the beginning of the season, it does not entirely resolve the problem. The effects of erroneous predictions are substantially evened out over time, but they are never completely eliminated and can add up. For an individual team the effect will be quite small, often negligible. However, the recent trend toward segregation in the early travel schedule magnifies the problem, especially if there is a disparity in the strength of competition at the different tournaments. If there is no prior information on the teams, the ratings are unable to distinguish between an undefeated team at one tournament versus another.
The current iteration of the ratings relies on the eventual merging of the segregated pools of debaters to even things out over time. Given enough time and intermingling of teams, it would. Unfortunately, the debate season is a closed cycle, and the ratings would be helped if they could accelerate the process.
The solution to this problem is actually relatively simple. If the problem is a lack of information to form a reliable rating for assessing how good your opponent is, then what we need to do is give the algorithm more information. One way to do this would be to go into the past, using results from previous seasons to form an estimate of the team's skill. However, beyond the fact that this doesn't address the lack of information on first year debaters or the complexities of new partnerships, I find this solution undesirable because it also violates the singularity of each season.
Instead, what we can do is use information from the future to gain a more accurate picture of opponent quality. It is possible to use results from subsequent rounds to form a better estimate of how good a given opponent is. What the new ratings do is effectively run the ratings algorithm twice. On the first pass, it creates a provisional rating for each team that uses all available information -- for example, when I update the ratings in January after the swing tournaments, it will use all rounds from the beginning of the season through those tournaments. On the second pass, it will use those provisional ratings in its predictions to estimate opponent strength until such a time as that opponent has a sufficiently reliable rating.
To be clear, this does not involve double counting. The provisional rating is only ever used to evaluate how strong an opponent is. The second pass starts each team's actual rating from scratch. When Team A debates Team B in round one of the season opener, the ratings create a separate prediction for each side. One prediction will be between the ratings for a blank slate Team A versus a reliable Team B; the other between a blank slate Team B and a reliable Team A. The first will be used to update Team A's rating, the latter to update Team B's rating. The algorithm eventually stops using a team's provisional rating once that team's actual rating becomes reliable enough (i.e., its deviation becomes small enough, a length of time that varies, but is usually reached in the neighborhood of 25 rounds).
Other Small Changes
There are a couple of other small changes that will have some effect on how the rankings are calculated.
The first concerns how the final ratings are determined by subtracting a team's deviation from their rating to produce an "adjusted rating." The original reason for doing this is that it gives a more "confident" rating by adjusting downward those teams that we have less data about. In effect, it says that we are confident that a team is "at least" as good as their adjusted rating. If two teams have about the same mean rating but one has significantly fewer rounds then the other, then we should be less confident that their rating is accurate.
While helpful to weed out teams with high deviations, there is a limit to the usefulness of this procedure. When most regularly travelling teams end the season with somewhere between 80 and 100 rounds, it is somewhat silly to use deviation as a tool to delineate between them. This past year, there were a few examples even in the top 25 where a lower rated team was able to jump a higher rated team merely because they had a few more debates under their belts.
In the future, the ratings will continue to use deviation as a tool to adjust the ratings of teams, but it will stop making delineations once teams reach a certain threshold. This threshold will be calculated as the median of the 100 smallest deviations.
A second change is that the ratings will no longer attempt to model individual debaters with multiple partnerships. Instead, it will treat each two person team as a discrete unit. The obstacles to being able to model multiple partnerships are just too large, primarily because we just don't collect the kind of data that would make it possible. How much is each partner responsible for a win? Does this question even make sense to ask? We all know that a great debater can carry a poor partner to a lot of wins. But we also know that even a good debater will lose rounds that they otherwise wouldn't have if they travel with a partner with lesser skill.
I know that this may disadvantage some debaters who are forced to frequently change partners, but it would be generous to even say that my previous attempts to solve the problem looked like trying to duct tape a windshield on. It kinda sorta worked, but mostly by luck, and even still had the effect of perhaps unfairly harming the ratings of some debaters.
Finally, I have removed the eigenvector centrality component of the previous system. This was originally a way to ensure that a team possessed a set of rounds that were adequately integrated into the larger community pool. TrueSkill doesn't need it.
I've attached a copy of the R code that I use to run the algorithm. I make no claims to being a good coder. What little I know is self-taught. It's slow, but it gets the job done. XML files of tournament results are available for download on tabroom by using their api.
Resources concerning TrueSkill can be found at Microsoft Research. There is a great summary here, and a really good in-depth explanation of the mathematical principles at work has been written by Jeff Moser.
The final ratings for the 2015-16 season are posted.
Congrats to everybody on a great season. If you went to a tournament, I applaud you. The first one is always the hardest.
Just a reminder to not just look at the rank order, but also check out the Adjusted Rating to get a sense of how close some teams are. A difference between two teams that is in the single digits basically means a coin flip. Teams 10 through 14 are functionally tied. As are teams 15 through 18. Very small point spreads can sometimes make for fairly large rank differences.
First things first, I want to make it clear, once again, that I work for Michigan. Relatedly, the ratings are the result of a mathematical algorithm and are not a reflection of my personal opinions. If anybody has questions about particular placements, contact me on facebook and I might be able to provide an explanation.
Second, these ratings do not include the Dartmouth Round Robin. Those results are not yet posted on Tabroom, and it would be a serious hassle for me to try to manually code them. If I were to guess what effect their inclusion would have, I believe there's a good chance Emory SK could jump into the same rating ballpark as Michigan/Harvard and Wake Forest AS would drop to 7th.
EDIT: The ratings do now include Dartmouth. The effect was basically as predicted.
Third, I know there's a lot of stress about bid voting. I want to make it clear that these ratings are not intended to replicate the decisions of bid voters. I would not encourage anybody to take the top 16 here and assume that those teams are necessarily first rounds (nor would I expect that any bid voter would do this). The ratings provide one way of processing and understanding who the best teams in the country are. However, because of the nature of the beast, the ratings are also capable of some misses when something is limited/different/peculiar about the data for a team: maybe a team hasn't attended as many tournaments, maybe their schedule has included more regionals at the expense of national level tournaments, maybe they've had multiple partners, etc.
Fourth, caveats aside, the ratings are information. They are a way of processing results that produces a good picture of who is likely to beat whom. To that extent, I think that they can a useful tool for bid voters. I encourage people to maybe deemphasize the "ranking" number and instead look more at the "rating" number and find places where ratings cluster together. Small ratings differences might indicate that there's not a meaningful difference between teams. For example, a 15 point rating difference means that the higher ranked team is only about a 52% favorite against the lower ranked team (if you want to play around with these numbers, the "Prediction Calculator" page found on the menu bar above lets you calculate win probabilities). Small differences can be a product of noise or some random variation that happens naturally.
Fifth, participation awards for everybody! Everybody's a winner!
Sixth, some teams are more winners than others! Congrats to Emory KS, Wake Forest AS, Rutgers NM, and Liberty BC on some great performances over the last few weeks.
The newest ratings are now posted. We're getting to a point where the ratings are showing a lot less movement. The majority of the teams in the top 50 are within a handful of ranks of where they were previously. Deviances are approaching their lower limit for teams that have been heavy travelers.
Nevertheless, there are a couple of interesting developments to note:
The last set of ratings before the holiday break are now posted.
Once again, all the caveats about limited data still apply. Nevertheless, the ratings would still "retrodict" about 79% of debates correctly, which is actually fairly well in line with how the final ratings from previous seasons performed.
There were many notable performances worthy of acclamation. Northwestern MO's run to the finals at the Wake Forest tournament helped them move up 5 spots to #10. Vermont BL went 7-1 and had some huge prelim wins, inching them closer to the top 10. Despite losing in doubles, Berkeley SW's undefeated tear through prelims enabled them to rise to #5. Major gains were also made by Kentucky GV, Baylor BZ, WGU MS, Baylor EG, and Kentucky AM. Wake Forest AS was hurt by virtue of remaining stationary, as a couple of teams gained the points necessary to move past them. Having attended only 3 tournaments is saddling Wake with a relatively large deviation. Both Kansas BR and BiCo moved up a handful of spots to be right at the first round cutoff. Central Oklahoma HS combined a perfect regional performance with a solid showing at Wake to move up several spots. Finally, the largest jump in raw points came from Houston BR, who advanced 55 spots by picking up 139 points with their efforts at Wake.
Some may wonder why Berkeley MS still leads Michigan KM at the top of the ratings given Michigan won the Shirley (and Harvard before that). There is a lot that could be written on the subject, but I'm going to keep my comments relatively brief due to potential conflict of interest issues. The short of it is that Berkeley did little to persuade the system that its rating was too high, and Michigan didn't do enough to persuade the system that its rating was too low. Specifically, based on their previous rating Berkeley was expected to pick up about 9.2 wins against the 11 opponents they competed against at Wake. They went 9-2 and so fell short by 0.2 wins. Michigan was expected to get about 9.4 wins in their 13 rounds. They went 11-2, thus exceeding the prediction by 1.6 wins. Michigan did make a fairly substantial gain on Berkeley in terms of ratings points, but it wasn't enough to make up the balance. It probably doesn't take any fancy statistics to look at the past two tournaments and see two things about their relative performances. Michigan has more quality wins, but Berkeley has better losses. Berkeley's only losses have been to the #2, #3, and #7 teams. By contrast, Michigan lost to #1, #11, #18, and #49. On the other hand, over that same time period Michigan has 9 wins against those currently ranked in the top 10 versus Berkeley's 4. Either way, both of these teams are performing exceptionally (as are many others), and it will be interesting to see how the rest of the season plays out.
The second set of ratings (which includes everything up to the Harvard tournament) are now posted.
Remember, the ratings are a self-correcting work in progress that are only as good as the available data, and right now we're still working on relatively limited data (especially in a world of a fairly divided travel schedule) that is open to large fluctuation. Also, it's important to remember that the ratings are *are not* meant to be a reflection of community perceptions of who is "the best" or even necessarily who has had the "best season." Those involve value judgments that go beyond the scope of what is trying to be accomplished here. The ratings are nothing more nor less than a system to predict future results based on past results.
There are a number of big movers in this edition of the ratings.
Perhaps most surprisingly, Berkeley MS jumps to the top spot and Harvard HS drops to 3rd. Even though they didn't win Harvard, Berkeley arguably had the best tournament of anybody there. Their only loss came to the #2 ranked team in semifinals on a 2-1 decision, and they picked up a number of big wins along the way (including one over the lone team that they lost to). Michigan KM's winning of the Harvard tournament was tempered by a mediocre 4-4 finish at the Kentucky RR. Harvard HS's rating dropped because they underperformed relative to expectations at the round robin by going 5-3. If you look closely, however, you can see that they actually still maintain a raw ratings advantage over Michigan, but the fact that they've only attended 3 tournaments drops their adjusted rating to place them slightly behind. Despite winning the Weber RR, Iowa KL dropped a couple of spots due to a disappointing Harvard. However, their actual rating didn't change much. The drop was more a function of others gaining the points to surpass them.
Michigan State ST was a huge winner, jumping 7 spots due to their excellent performance at the Kentucky RR. They managed to pick up 7 wins there despite the fact that their previous rating would have pegged them to go 4-4. A fantastic UNLV and strong Harvard tournament enabled Texas KS to jump 11 slots to number 11. Vermont BL also jumped to the edge of first round territory. Perhaps the biggest leap was made by USC SS, who gained nearly 50 spots to now sit at 18th.
Further down the list, we saw also excellent performances by Baylor BC, Georgia BR, Trinity RS, Texas CS, UNT CS, Rutgers HQ and WSU CW, among others.
Looking at last season's results, which amount to a data set of 7272 debates, the Affirmative won 52.5% of the time. Whether one considers this a large split is somewhat subjective, but using a binomial distribution we can say that the odds of this happening randomly are functionally zero. In terms of the Glicko ratings, a team gains something like a 17 point advantage by virtue of being affirmative.
To be clear, these numbers do not speak to the cause of the bias. Some might like to say that it's the proliferation of non-topical cases or "new debate." Or perhaps it could be due to the death of case debate, a general favoring of "Aff choice" on framework, or a biased resolution. However, the cause is a question for another day - a question that would require a lot more work than I'm willing to give it at the moment. What I can say is that it cannot be explained by the possibility that stronger teams somehow managed to be Aff more often. I don't know how this would even happen in the first place, but it's not borne out by the data anyway. The average Aff rating is only a fraction of a point higher than the average Neg rating.
While it is possible that the bias was due to the nature of last year's resolution, so far the trend has continued this year as well. Aff win percentage is 52.2% over 2248 counted rounds, which would only happen randomly about 2% of the time. Perhaps the advantage will level out as the season goes on. Conventional wisdom sometimes says that the Aff's advantage is greatest at the beginning of the season when teams have not yet had the chance to fully prepare negative strategies. However, if last season is any indication, we cannot expect to see any evening out. Aff win percentage fluctuated a bit over the course of the year, but it always stayed in the black. In fact, records showed the greatest advantage for the Aff during the second semester period in between the swing tournaments and district qualifiers.
The effect may also apply more narrowly to elimination rounds, perhaps even being magnified depending on how one interprets the data. Keep in mind that the size of the sample is much smaller. Whereas the total data set for 2014-15 was 7272 debates, there are only 563 elimination rounds counted. Of these, the affirmative won 54.4% - a definite step up from the prelim data. However, it's also possible to interpret the results in a way that suggests that the teams that were on the Aff were overall slightly better than the negatives. While the mean rating for negatives was about 5 points higher than for the affirmative, it happens to be the case that in 52.1% of debates the Aff was favored to win. In the end, the bias in elims appears to have been probably about the same as prelims. However, the smaller sample size means that randomness is more likely to have played a role. We could expect to see numbers like these or worse around 14% of the time even if sides were equally balanced.
The data for this year is even more limited. I have 184 elim rounds, of which the affirmative has won 57.6%, a number that is quite high. However, once again, the Aff was also the favored team in 53.9% of these debates, suggesting that the advantage might not be that different than it has been for prelims.
The graph below breaks down the 2014-15 season to more narrowly show how each side performed as varying degrees of favorites. The breakdown suggests a larger bias for Affs at lower point spreads that evens out as the gap between teams rises. The data does provide some support for Rashad Evans's hypothesis that bigger upsets can be scored as the negative. Once the point spread reaches around 300, Neg win percentage starts to outpace the Aff -- though, of course, at this point any chance of upset is very small.
The first set of ratings for 2015-16 are now posted.
Full disclosure: In addition to my work with Concordia, I am also helping Michigan this year. To avoid possible conflict of interest problems, I have made no changes to the ratings algorithm since summer and will also make no changes over the course of this year. This is pretty easy with Glicko style ratings because once you set them going all you have to do is enter new results data as they arrive.
A quick refresher on how the ratings work: Glicko style ratings are determined in a self-correcting relational way based on who a team competes against. If you win, your rating goes up. If you lose, your rating goes down. How much it moves is based on the rating of your opponent. If you beat somebody with a much higher rating, yours will go up a lot. If you beat somebody with a much lower rating, yours might barely move at all. And vice versa for losing. At the beginning of the season, each team starts with the same rating (1500). As results come in, the ratings begin to separate teams by moving teams up and down as they win and lose debates. Since there is little data early on, the ratings are much rougher at the beginning. They gradually become more fine tuned over the course of the season. They need some time to sort themselves out. More data = better. More data also gradually stabilizes a team's rating. At the beginning of the season the ratings are more unstable and react more quickly to change than they do at the end of the season. The numeric value of a team's rating is essentially a predictive value. The difference between two teams' ratings forms the basis of a prediction concerning the outcome of a debate between them. For example, a team with a 200 point advantage is considered to be a 3:1 favorite over their opponent. You can find out the predictions for your own debates by using the prediction calculator.
Comments on where the ratings sit now:
As the new season bears down on us, I've made a couple of updates to the ratings.
1. Ratings for 2013-14: I've posted the ratings for the 2013-14 season and have begun to organize the site by yearly results. Tabroom data goes back to the 2012-13 season, and I plan to download and process those results at some point in the near future. The 2013-14 results are largely complete, but a small amount of data is missing (a couple of invitationals, regional qualifiers, and ADA).
2. New Career Ratings: I've also added a new rating that charts progress across seasons. A description of the Career Rating can be found in the FAQ. In short, for the typical seasonal ratings, each debater starts the season with the default rating of 1500 and deviation of 350. It makes no assumptions about each debater's skill level. The career ratings, by contrast, use the previous season's rating & deviance for those debaters who are returning, while only assigning the default numbers to those who are not yet in the database. Currently, the career rating is only available for the 2014-15 season, but will be a part of the new season and will be added to 2013-14 when the data from 2012-13 becomes available.
3. Elimination of "Weighted" ratings: This actually happened some time ago, but I can make it official now. The weighted ratings were an attempt to account for the difference in quality of teams at the beginning of the season. I'm abandoning them for two reasons: 1) though the logic behind the weighted ratings was intuitive, the data suggested that the unweighted ratings actually performed nearly as well; 2) the new Career Ratings provide the same information in a more transparent way. The weighted ratings were actually something of an attempt at a middle ground between the seasonal ratings and the career ratings. They used teams' ratings from the previous year regressed to the mean but reset everybody's deviance. Now the two different ratings are simply listed separately instead of trying to turn them into some makeshift hybrid.
EDIT: To clarify, the Career Rating is not a reflection of a team's peak. Nor is it a "career achievement" style rating. Rather, it is the rating that one has at the end of that specific season / scoring period.