The newest update to the ratings is reflective of a change in the methodology of the ratings. The ratings now include a much wider range of tournaments but exclude rounds with participants whose rating would not be reliable due to lack of representative opponents. The reasoning and method are described in this blog post. The tournament results now included are listed on the ratings page.
A couple of reactions to the new ratings: First, Oklahoma Ard & Cherry reveal something of a difficulty for the ratings, one which hasn't been fully resolved. Though I don't release them, the true bearers of ratings are actually not team partnerships, but rather individual debaters, whose ratings are then averaged to produce the team rating. A strength of this arrangement is that it allows debaters to get credit no matter who their partner is. As a result, Oklahoma AC's rating has evolved even though the two of them have not debated together since the last ratings update. The difficulty for the ratings calculation is that the way that it deals with individual debaters joining up with new partners that have significantly different ratings is not necessarily perfect. The original design of the Glicko ratings was for one-on-one competitions like chess. Debate complicates this with its 2 person teams. This is not a problem for teams that stick together, but it does present some difficulty for those that change partners. The way that the ratings currently deal with this issue is to more heavily weight each individual's own rating but factor in the rating of their partner as well. Thus, for example, when Ard debates with Chiles, Ard's rating will grow more slowly than Chiles's will with each win and drop more with each loss. The disparity in their ratings will shrink over time if they continue to debate together. However, the ratings also don't treat Ard the same as if he were still debating with Cherry. He will be rewarded slightly more for wins and punished slightly less for losses given the rating's knowledge that he is debating with a lower rated partner. Second, the big winners with the update include Minnesota CE, Wake Forest LS, Kansas HR, Towson TW, Michigan DM, USC BL, KCKCC CN & CJ, and WSU MO. Third, Harvard DH continues to outpace Harvard BS. This demonstrates both a strength and weakness of ratings like these. BoSu's season has been stellar by any measurement, making it to 2 finals and 2 semis and accumulating only 4 losses. DH's results have also been top notch, but they haven't made it to any final rounds and have a lower overall win percentage, though they did win the Kentucky Round Robin. DH certainly has a lot of very good wins, but their (single point) lead on BoSu has more to do with the fact that they've done very well at six tournaments as opposed to BoSu's four. BoSu's position is due to their relatively small sample size. They just haven't had as much time to prove themselves. My impression is that this will probably work itself out before long unless BoSu decides to not travel again before the postseason. Number of rounds matters less and less over time, but since all teams start the season with the same default (1500) rating, it does take some time for each team to differentiate itself.
0 Comments
I apologize to anybody who has been curious about the post-swing ratings changes. The delay has been in large part because I have been working on a significant revision to the way that the ratings are calculated. Previously, only national level and large regional level tournaments were included in the ratings calculations. The reason behind this decision is discussed below, but the short version is that now the ratings will be able to include all (or nearly all) tournaments. Based on an analysis of the 2013-14 season, I believe that it is possible to include a much larger variety tournaments in a way that not only doesn't hurt the accuracy of the ratings, but actually improves on it. ## Why Tournament Inclusion is a Issue One might wonder why it would ever be a problem to include a wider variety of tournaments. After all, isn't more data better? Generally, this is true. In fact, glicko/elo style ratings have a significant advantage over some other kinds of ratings because tournament-level data isn't ever even entered into the calculation other than as a specific time period during which a set of rounds occurs. Whereas some other ratings have to "weight" results based on the perceived "quality" of the tournament, Glicko ratings sidestep this question because the only factor that enters the calculation is each and every head-to-head match-up and who wins. In other words, these ratings don't consider that "Team X went 7-1 at Harvard." Instead, they are only concerned with the fact that "Team X beat Teams A, D, F, R, T, V and Z, and lost to Team Q." Nevertheless, there is one sense in which the choice to include a tournament has an impact on outcomes and the predictive accuracy of the ratings. Ratings based exclusively on head-to-head data still depend on every team's opponents being measurably related to every other team's opponents. At a basic level, this means that sample size is important, but sample size can be easily addressed by excluding teams whose sample is too small. The harder problem to overcome is the distortion that can be caused by teams whose results are based on an excessively regionalized set of opponents. Since the meaning of a team's rating is really nothing more than their relative position within the overall network of teams, Glicko ratings run into a problem when a team is not adequately "networked." Take an extreme example. Imagine a small regional tournament in which none of the participants ever attend another tournament. Since the ratings are determined relationally, the winner may hypothetically emerge from the tournament with a rating of 1750 or higher (1500 is average; a team rated 1750 would be considered a 4:1 favorite over an "average" team). Since their and their opponents' ratings were not ever influenced by the ratings of other teams outside the tournament, this rating would be highly unreliable. Basically, what we have is a "big fish in a little pond" problem. A less extreme, but still noticeable, version of this problem exists when the ratings attempt to include small regional tournaments. Many of the teams that attend these tournaments never achieve a point where their opponents have become representative enough of the larger pool of debaters. People already consciously or subconsciously account for this problem somewhat intuitively when they make judgments of team quality. That's why, for instance, at-large bid voters may account for wins at regional tournaments but weigh them less, and generally it is very difficult for a team to get a second-round bid if they haven't attended at least a couple of national-level tournaments. I believe that I have figured out way to include (nearly) all tournaments in a way doesn't undermine the reliability of the ratings. ## Social Networking Analysis & Eigenvector Centrality Instead of making a decision on whether a tournament is adequately "representative" of the larger national circuit, we can rather focus narrowly on whether each individual team is adequately "interconnected" to make their rating reliable. If they are not, then we can exclude all of the rounds that they participate in, meaning that their opponent will neither be rewarded nor punished for the results. Eigenvector centrality (EV) is a metric used in social networking analysis to determine how connected any given individual is to the larger network. A basic definition can be found here. In short, "Eigenvector centrality is calculated by assessing how well connected an individual is to the parts of the network with the greatest connectivity. Individuals with high eigenvector scores have many connections, and their connections have many connections, and their connections have many connections … out to the end of the network" (www.activatenetworks.net). Eigenvector centrality has been used in a number of applications, most notably Google's PageRank algorithm. While it could theoretically be possible to use eigenvector centrality as the central mechanism for an entire ratings system, for my purposes I am using it much more limited capacity. It can be used to set a baseline degree of centrality that a team must meet for their results to be included in the Glicko calculation. Here, a high eigenvector centrality is not necessarily an automatic virtue (you could be highly interconnected and yet lose a lot of debates), but a very low EV score could be grounds for exclusion because of the unreliability of the data. Below I hope to show that setting a low EV hurdle is both a means to sidestep the tournament inclusion problem as well as a way to improve the overall accuracy of the ratings. ## Effect of EV Baselines on Ratings Accuracy To measure the effect of setting EV baselines on ratings inclusion, I calculated ratings from the 2013-14 season using six different sets of round data: (1) all rounds at majors and large regionals, (2) all rounds at all tournaments (with at least 20ish teams entered), (3) all rounds with teams above .1 EV, (4) all rounds with teams above .15 EV, (5) all rounds above .2 EV, and (6) all rounds above .25 EV. I then conducted two "experiments" using those ratings. First, I used them to predict the results of the 2014 NDT. Second, I ran a simple bootstrap simulation of 10,000 tournaments, using the various ratings to predict their results. First, the NDT. Lower scores mean less error in the predictions. The number of teams listed includes a lot of partial repeats due to debaters switching partners. ## NDT Predictions
MSE and MAE are pretty basic measures of error, but they have the advantage of being intuitive. MAE, or mean absolute error, simply measures the average amount that the ratings miss per round. Here it is represented as a percentage, meaning that all of the ratings miss by an average of roughly 25%. If this sounds like a lot, it is important to keep in mind the way that prediction error is calculated. Let's say that the ratings pick Team A as a 75% favorite (3:1) over Team B. If Team A wins in front of a single judge panel, then they score "1," or "100%." The ratings error is calculated by subtracting the outcome (1.00) minus the predicted outcome (0.75), meaning that despite picking Team A as a heavy favorite, there was still an error of 25%. It is also this difference between actual outcome and prediction that forms the basis of the calculation that determines how and how much the ratings change for each team after the round. MSE, or mean squared error, is only mildly more complicated. Its importance here is that it more heavily weights "big misses." Looking at the numbers, we can spot a few things: - Including only majors & large regionals does pretty well at reducing MAE in the NDT predictions, but it performs the absolute worst when big misses are weighted.
- Including all rounds from all tournaments has the worst MAE, but an average MSE.
- Setting
**an EV baseline of 0.2 produces both the best MAE and the best MSE**, but we see diminishing returns with the more restrictive 0.25 baseline. In fact, the 0.25 baseline produces the second worst MSE of the group.
Now to the simulated tournaments. You may notice a significant difference in the size of the numbers. This is because of the way that the ratings calculate error in rounds with multi-judge panels. The ballot count on a panel produces a fraction of a win ranging from 0 (unanimous loss) to 1 (unanimous win), with various percentages in between (for example, a 3-2 ballot count would count as a 0.6 win). This allows the predictions to get much closer to the correct outcome on panels than it does when the outcome is a single judge binary win (1) or loss (0). Since the NDT is judged entirely by panels, whereas the bootstrap simulation draws from the entire season of results, the majority of which are standard single judge prelims. To help address this difference, I've also added a third metric that is a bit better at evaluating the error in binary outcome situations called binomial deviance. EDIT: In the original version of this post, there was a significant error in the way that binomial deviance was being calculated. As a result, I had to rerun the simulation. In the end, the correction only very mildly influenced my conclusions, giving a tiny bit more evidence in favor of the EV .15 ratings. ## Bootstrap Simulation Tournament Predictions
Things to note: - Including only the majors and large regionals is the
**worst by every metric**and by a large margin**.** - Including all rounds from all tournaments is second worst at both MAE and MSE and only average at BDV.
- The
**0.15 EV baseline comes out best**in both MSE and BDV, and only barely behind in MAE. EV .2 performs well in terms of MAE and MSE, but is average according to BDV.
Both analyses are based on a single seasons worth of data, but they both seem to confirm the same basic conclusion: basing the ratings on either just the majors & large regionals or on nearly all tournaments produce the least accurate predictions. By contrast, setting a eigenvector centrality baseline of somewhere around 0.15 or 0.2 produces the most accurate ratings. Next, I am going to look at how these ratings match up with 1st & 2nd round at-large bid voting. While this voting admittedly comes from a small group of individuals, it should provide us at least one point of comparison to see whether the ratings produce outcomes in line with community expectations. ## Effect of EV baselines on At-Large Bid Ranking For the sake of simplicity, I'm only going to produce three sets of rankings to contrast with the voters: (1) all rounds from all majors & large regionals only, (2) all rounds from all tournaments with at least 20ish entries, and (3) all rounds from teams with an EV centrality above 0.15. ## First Round At-Large Ranking
The first thing you should notice is a striking amount of similarity. The rankings produced by the Glicko system are pretty similar across the board with minor variations. Additionally, they produce a nearly identical set of bid teams. If Harvard HX gets excluded for being the 3rd Harvard team, then - The majors & large regionals ratings pick 16 of 16
- The All Rounds ratings pick 15 of 16, preferring Oklahoma LM over Kansas BC
- The EV 0.15 ratings also pick 15 of 16, with Oklahoma LM again stealing the 16th spot
One may wonder about the Wake Forest MQ discrepancy. This was discussed in my previous post about round robins, and it is largely due to the distorting effect of the Kentucky RR, the removal of which would bump them up into the 9-11 range. Some notable, but relatively small, gainers are Michigan AP, Wake Forest LW, Georgetown EM, Michigan State RT, West Georgia AM, and Harvard HX. Moderate losers include Towson JR, Berkeley MS, and Texas FM. On the whole, however, teams are pretty close to the spots assigned by the voters. The average (mean) deviation for each of the rankings from the voters are in the neighborhood of 2.1 spots. And though I didn't bother to run the calculations again with fixes made to the Kentucky RR, my rough estimate is that making those changes could get the mean deviation somewhere close to 1.5 or 1.6 Such a deviation would be in line with the average voter's difference from the aggregate voting preference. One final observation: it makes intuitive sense that the different Glicko ratings wouldn't be too far off from one another for these teams. They mostly attend majors and probably don't compete as much against those teams that would either be included with the All Rounds rating or those that would be excluded by the EV baseline ratings. Now, for second rounds. One important factor here is that the Glicko ratings did not include district tournaments in their calculations. If a team had an exceptionally strong or weak district tournament, it may have been accounted for by the voters, which would produce a discrepancy. ## Second Round At-Large Voting
Observations: - The All Rounds and EV .15 ratings produce the same set of teams, and match the voters on 15 of 17 bids. The Majors & large regionals ratings match the voters on 14 of 17.
- The Majors & large regionals ratings really hurt Oklahoma BC, who is highly preferred in the other sets of ratings but drops to 11th (and out of the bids due to being behind too many other "3rd teams").
- As I noted in an earlier post, the ratings really like Baylor EB, who jumps more than 10 spots compared to the voters, but this is one that could be heavily influenced by the fact that the ratings didn't include their poor districts results. Nevertheless, I don't think there's a world in which they'd drop out of the top 17.
- The All Rounds and EV .15 ratings were not as high as the voters on Wake Forest DL, but the Majors rating did like them. My best guess for the reason for this is just that the ratings for teams 4 through 9 are quite compact. It wouldn't take much to push a team up or down a couple of spots.
Overall, the mean deviance of the ratings from the voting average was 2.83 for the Majors, 2.5 for All Rounds, and 2.39 for EV .15, making it the closest. In fact, all of the ratings were closer to the voting average than were the average of each individual voter's deviance, which was 3.13 spots. More narrowly, only 3 individual voters were closer to the voting average than the EV .15 ratings (DCH, Lee, and Repko). ## Conclusion Based on the above findings, I'm in the process of reworking the 2014-15 ratings to include all rounds from (nearly) all tournaments with participants who are above a baseline 0.15 EV score. I think this checks all the advantage boxes. It's more "fair," it involves less personal judgment on my part, and it is ultimately more accurate anyway.
One might wonder as to why I've said "nearly" all tournaments. This is mostly a function of time. It takes a not insignificant amount of time to format, clean, and process each tournament's results. Then, I would need to further test them to make sure that nothing super kooky happens with extremely small tournaments. So, for now, I'm keeping it at tournaments that have at least 20ish teams. I might dip below that if it's close enough to round up. Or I might include a smaller tournament if there is a specific reason that it ought to be included. Here is the list of tournament that fell into each category above... Majors & Large Regionals (15 tournaments): UMKC, GSU, Kentucky RR, Kentucky, UNLV, Harvard, Wake Forest, USC, UTD, CSUF, UNT, Dartmouth RR, Pittsburgh RR, Weber RR, Texas.(Nearly) All Tournaments (30 tournaments): UMKC, Northeast Regional Opener, GSU, Gonzaga, Kentucky RR, Kentucky, JMU, KCKCC, ESU, UNLV, Harvard, Clarion, UCO, Vanderbilt, CSUN, Wake Forest, Weber, USC, UTD, CSUF, UNT, Chico, Navy, Dartmouth RR, Indiana, Pittsburgh RR, Weber RR, Wichita State, Georgia, Texas. Just kidding. I'm not really going to answer this question, mostly because it involves some pretty important value questions that can only be answered through community consensus. However, I do think that it is important to at least consider the concrete impact the decision would have on the outcomes of the ratings. I went back to the data for the 2013-14 season and calculated the results both with and without the Kentucky & Dartmouth round robins. For the most part, the results confirmed my expectations, but there were a few interesting issues that emerged along the way. As a caveat, keep in mind that this analysis is based on the data from a single season, and even more narrowly, the results of two tournaments. It's not possible to draw truly rigorous conclusions from such a small sample. On the other hand, there are a couple of ways that we can try to minimize the limitations of the sample size, which I will get to toward the end of the post. ## Overview One common argument against including round robins is that they potentially give an unfair advantage to participants when it comes to first round bid voting or other forms of recognition. I can't really address this issue specifically, though I suspect it may be true. However, I can look at whether the inclusion of the round robins gives participants an unfair advantage when it comes to the Glicko ratings. Quickly, a reminder concerning exactly what the Glicko ratings attempt to represent. In contrast to many other methods of rating competitors, Glicko is not primarily a representation of a team's previous competitive results. Instead, it is an estimate of a team's ability relative to other teams. It uses past results to make predictions about what future results would be. Specifically, the difference between two teams' ratings can be translated into a win probability prediction of a hypothetical round between them. This is an important distinction because it means that ratings go up or down based not on raw success or failure, but rather how a team performs relative to expectations. As a result, a team with a high rating will not necessarily see a ratings gain after a strong performance because they were already expected to have a strong performance. It was "baked in" to their rating going in. Their rating will only go up if they perform even better than they were expected to.This has important implications for how tournaments like the round robins would be weighed in the ratings calculation. Figuring out how to weigh a round robin has always bumped up against the fact that the difficulty is not easily comparable to a regular tournament. 5-3 at the Kentucky round robin is not equivalent to 5-3 at any other tournament, even the NDT. As a solution to this problem, some have suggested only "rewarding" success at a round robin, but not "punishing" failure. This is a strange logic to me. Should elim losses also not be considered? Any loss against a team that is considered to be good? This feels wrong to me, but I do agree that we should try to figure out a good way of incorporating these rounds in a way that best accounts for their difficulty while also not giving people an advantage for merely being present. I think the Glicko rating system has a pretty good answer to the problem for a couple of reasons: 1. The aggregate average rating for all teams at the end of a tournament will always be nearly equivalent to the aggregate average at the beginning of the tournament. This is because ratings are determined relationally and any gain made by one team is always matched with loss by another. A team's rating will never suffer because they did not attend a specific tournament. Similarly, a team's rating will never advantaged through attendance at a specific tournament. 2. Since ratings changes are determined based on how a team performs relative to expectations, opponent strength (and thus, indirectly, the strength of the tournament) is the core element. A lower rated team will not be expected to win very many rounds, so going 1-7 or 2-6 wouldn't necessarily have much of any negative impact on their rating. However, if they went 3-5 or 4-4, then they might see a solid ratings bump despite having a mediocre record. Looking at the 2013-14 round robins, we can evaluate how teams performed relative to expectations and then how that impacted their final ratings. ## How the Round Robins Impact Ratings For the moment, I'm going to use the end of season (but pre-NDT) ratings to make the tournament predictions rather than the rating that each team had going into the tournaments, the assumption being that it would be a more accurate representation of the team's skill because it has more data to support it. This assumption is not necessarily supportable because it doesn't account for the growth or regression that teams often experience over the course of the season. For reasons that will become apparent later, however, using the pre-tournament rating is not necessarily the best way to gauge the long-term effects of inclusion of the tournaments either. Below are two tables, one for Kentucky and one for Dartmouth. The tables include each team's end of season rating, a calculation of how many wins they would be expected to gain at the tournament, and how many they actually achieved. The expected wins stat was calculated by adding together a team's win probabilities (each of which can be understood as expressing a fraction of a win) against each opponent. ## Kentucky RR
## Dartmouth RR
At first glance we can spot a few things: 1. Michigan AP substantially outperformed expectations. Between the two tournaments, they won about 3 more rounds than expected. 2. Wake MQ was 1.24 wins short of expectations and Oklahoma LM was 1.03 wins short. Otherwise, all other teams were within a win of their expectation. Concerning the impact that the tournament results should have on the ratings change post-tournament, we would expect that Michigan should be substantially advantaged by ratings that include the round robins, Wake & Oklahoma should be a little bit disadvantaged, and everybody else should stay about the same. Below is a table that shows a comparison between the end-of-season (pre-NDT) ratings that include the round robins versus those that exclude the them. The above predictions are certainly born out, but there are actually a couple of unexpected developments as well...
First, Michigan definitely does benefit from inclusion of the tournaments. Removing the round robins drops them from 2nd overall to 5th (falling behind Wake's LeDuc & Washington as well, though this might not be fair because I left in the results of the Pittsburgh RR, where Wake LW did quite well). Second, Wake and Oklahoma do benefit from excluding the round robins, but their gain is far larger than would be expected from their very modest deviations from the predictions. Wake gains a massive 71 points, enough to move from 13th overall up to 8th. Oklahoma gains only (a still large) 25 points, but that is enough to move them up 8 spots in the rankings from 24th to 16th. Third, the impact on the majority of the teams is negligible. 8 teams stay within 1 ranking spot of where they would have been otherwise. Fourth, somehow Harvard BS lost a couple of points with the inclusion of the round robin despite having outperformed expectations by 3/4 of a win. Fifth, what's up with Kentucky GR?!?!?! They were only about a half a win short of expectations, but somehow they lost 31 points and 8 ranking spots at the round robin. Finally, and most subtly, there is something very peculiar happening with the aggregate averages. Remember, above I said that the aggregate average of all teams' ratings at a tournament should be about the same at the end of the tournament as it was at the beginning. Points are basically zero sum. However, in this instance, we see that the average is 15.5 points lower when the round robins are included. This seemingly shouldn't be possible. After some digging, I was about to figure out that what is happening in these latter three observations springs from the same source, and it presents a real problem for how to evaluate the results of the Kentucky round robin in particular.Glicko ratings are vulnerable to overall point deflation in one specific circumstance: when very good teams have been inactive or are significantly underrated. This is one consequence of the fact that all teams (in the unweighted version of the ratings) start their first tournament with a default rating of 1500. Despite the fact that we know this rating is too low, this is not usually a problem. Since the "average" is always necessarily the midpoint of a given pool, there will always be, by definition, the same amount of ability points (though not necessarily debaters) that are above average as below. However, when we start adding new debaters to the pool at subsequent tournaments, we don't know if they're going to be above average or below average. If they're above average, then the default of 1500 will underrate them (and vice versa). The reason that this can cause deflation is because the only way for the significantly underrated team to get its proper rating is to take points from somebody else. At a large invitational tournament, the impact of this will not be particularly noticeable for a few reasons, including among other factors that there will also be a set of overrated debaters from whom points will be redistributed to make up for it. At the 2013 Kentucky round robin, however, it was very noticeable. Here are the ratings of each team heading into the tournament (after only 1 period of tournaments had been entered into the ratings - UMKC & GSU) compared to their end of season ratings:
Overall, what we have is a very small set of debaters, some of whom are massively underrated. Oklahoma is rated 1500 because they didn't compete in a season opener, so the round robin was their first tournament. Mary Washington had an extremely poor performance at Georgia State, going 5-3 and losing in doubles. To put it in perspective, Mary Washington's true rating would make them a 6:1 favorite over a team with a rating of 1546. In contrast, only one team entered the round robin with a better rating than they would end the season with - Wake Forest - and there was also one other team that was only moderately underrated, which was Kentucky. The effect that these discrepancies can have on the ratings is not insignificant. If everybody were underrated, it wouldn't be a big deal. They would trade points among each other, and then go back to a normal invitational tournament, where they would start stealing points away from the larger pool. However, the large difference in the accuracy of each team's ratings created a situation in which a couple of teams were losing disproportionately large amounts of points because they were wrongly predicted to win rounds that they should have been predicted to lose. To make it more concrete, Kentucky was actually considered to be a 64% favorite against Mary Washington. Their end of season ratings would indicate that in actuality Mary Washington was a 76% favorite (better than 3:1) in that debate. The result is that when Wake Forest and Kentucky lost, they lost big. This issue certainly does raise questions about how to best approach the inclusion of the Kentucky round robin. It honestly would not really have been as much of a problem if Mary Washington, Oklahoma and Michigan hadn't been such outliers. There would have still been some discrepancy, but the impact would have been substantially less. For example, consider instead the ratings of the teams prior to the Dartmouth RR:
There is still some over- and under-rating, but that's how it should be or there would be no reason to have the debates. Here, the discrepancy is in fact within the expected amount of ratings deviance assigned to each team. As a result, the effect of Dartmouth round robin results on post-tournament ratings is much more consistent with what we would expect than was Kentucky. Looking at the difference between the final season (pre-NDT) ratings when Dartmouth is included versus excluded, we see the following:
These results are much more like what we should expect. There is some movement in ratings, but nothing out of the ordinary. Mary Washington benefits from the inclusion of the round robin, which makes sense because they won 1.4 rounds more than predicted. Wake Forest loses ground because they fell .9 wins short of their prediction. Everybody else was within half a win of their prediction, so they saw only a marginal change in their rating. ## How the Round Robins Impact Predictive Accuracy That tournaments have an impact on ratings is not on its own a bad thing. If those changes make the ratings more accurate, then there would be an argument in favorite of including those results anyway, maybe even if it causes overall rating deflation. As a simple test, we can compare how well each set of ratings is able to predict results at last year's NDT. Below are the mean square error and the mean absolute error of 4 different sets of ratings.
I'm not really going to get into what these averages mean here, but the upshot is that there is not a whole lot of difference. The ratings that totally exclude the round robins end up being the most accurate, but not by much. Oddly, excluding Dartmouth is comparatively better than excluding Kentucky despite the issues discussed above. However, basing an assessment of the predictive accuracy of the ratings on how well they predict a single tournament isn't a terrifically great metric. There is a ton of noise in a single tournament. Given the small margins, it wouldn't take many upsets for the metric to be thrown off. To solve this problem, we can run a bootstrap simulation that significantly expands our dataset. To do this, we take the entire set of 2013-14 rounds and repeatedly re-sample the data, creating as many hypothetical tournaments as we like. Then, we can see which set of ratings does the best job of predicting those hypothetical tournaments. I ran the ratings through 10000 tournament simulations, averaging their errors.
You may notice that these numbers are substantially higher than the numbers given above regarding NDT predictions. This is due to the fact that every round at the NDT is judged by panels, whereas the results of most other debates during the season are binary. Split decisions on a panel are calculated as a fraction of a win, allowing the ratings to get significantly closer to an accurate prediction. For example, if a 67% favorite wins on a 2-1, then the ratings were essentially right on. Whereas, if a 67% favorite wins on a 1-0, then prediction is calculated as being 33% off. For this reason, I've included an additional metric for evaluating error, called binomial deviation. This stat is designed as a better way to evaluate error when faced with binary predictions. Since the majority of the rounds in the simulations will be binary win/loss debates, this might be a better way of comparing the ratings. Returning to the numbers, we find results much more in line with what we would have expected given the problems with the Kentucky ratings. The most accurate predictions still come from the ratings that exclude both round robins, but almost all of the error of the round robins is accounted for by merely removing the Kentucky results. In fact, when it comes to mean absolute error, the ratings that include Dartmouth produce nearly the identical outcome as those that exclude both tournaments. Again, a cautionary note about too quickly drawing conclusions from the data. Even though I simulated a large set of tournaments to compare the ratings against, we're still only talking about the impact that 1 or 2 tournaments are having on the ratings themselves Once I get the 2012-13 data cleaned up it will offer another point of comparison, but that's the end of data available on tabroom.com. It would be possible to simulate the round robins for resampling, but I'm not sure that that will necessarily be that helpful. What does seem clear from this analysis is that there is a need to rethink how the Kentucky data is handled. While its effect on the ratings overall is not exactly huge, the damage is clearly observable and somewhat predictable given how early the tournament is in the semester. It is notable, however, that this should only be a problem for the "unweighted" version of the ratings (for more on the difference between "weighted" and "unweighted" ratings, see the FAQ). A system that assigned weighted start values to team ratings at the beginning of the year (perhaps based on previous season ratings) would mitigate the risk of underrating. |