I apologize to anybody who has been curious about the post-swing ratings changes. The delay has been in large part because I have been working on a significant revision to the way that the ratings are calculated. Previously, only national level and large regional level tournaments were included in the ratings calculations. The reason behind this decision is discussed below, but the short version is that now the ratings will be able to include all (or nearly all) tournaments. Based on an analysis of the 2013-14 season, I believe that it is possible to include a much larger variety tournaments in a way that not only doesn't hurt the accuracy of the ratings, but actually improves on it.
Why Tournament Inclusion is a Issue
One might wonder why it would ever be a problem to include a wider variety of tournaments. After all, isn't more data better? Generally, this is true. In fact, glicko/elo style ratings have a significant advantage over some other kinds of ratings because tournament-level data isn't ever even entered into the calculation other than as a specific time period during which a set of rounds occurs. Whereas some other ratings have to "weight" results based on the perceived "quality" of the tournament, Glicko ratings sidestep this question because the only factor that enters the calculation is each and every head-to-head match-up and who wins. In other words, these ratings don't consider that "Team X went 7-1 at Harvard." Instead, they are only concerned with the fact that "Team X beat Teams A, D, F, R, T, V and Z, and lost to Team Q."
Nevertheless, there is one sense in which the choice to include a tournament has an impact on outcomes and the predictive accuracy of the ratings. Ratings based exclusively on head-to-head data still depend on every team's opponents being measurably related to every other team's opponents. At a basic level, this means that sample size is important, but sample size can be easily addressed by excluding teams whose sample is too small. The harder problem to overcome is the distortion that can be caused by teams whose results are based on an excessively regionalized set of opponents. Since the meaning of a team's rating is really nothing more than their relative position within the overall network of teams, Glicko ratings run into a problem when a team is not adequately "networked."
Take an extreme example. Imagine a small regional tournament in which none of the participants ever attend another tournament. Since the ratings are determined relationally, the winner may hypothetically emerge from the tournament with a rating of 1750 or higher (1500 is average; a team rated 1750 would be considered a 4:1 favorite over an "average" team). Since their and their opponents' ratings were not ever influenced by the ratings of other teams outside the tournament, this rating would be highly unreliable. Basically, what we have is a "big fish in a little pond" problem.
A less extreme, but still noticeable, version of this problem exists when the ratings attempt to include small regional tournaments. Many of the teams that attend these tournaments never achieve a point where their opponents have become representative enough of the larger pool of debaters.
People already consciously or subconsciously account for this problem somewhat intuitively when they make judgments of team quality. That's why, for instance, at-large bid voters may account for wins at regional tournaments but weigh them less, and generally it is very difficult for a team to get a second-round bid if they haven't attended at least a couple of national-level tournaments.
I believe that I have figured out way to include (nearly) all tournaments in a way doesn't undermine the reliability of the ratings.
Social Networking Analysis & Eigenvector Centrality
Instead of making a decision on whether a tournament is adequately "representative" of the larger national circuit, we can rather focus narrowly on whether each individual team is adequately "interconnected" to make their rating reliable. If they are not, then we can exclude all of the rounds that they participate in, meaning that their opponent will neither be rewarded nor punished for the results.
Eigenvector centrality (EV) is a metric used in social networking analysis to determine how connected any given individual is to the larger network. A basic definition can be found here. In short, "Eigenvector centrality is calculated by assessing how well connected an individual is to the parts of the network with the greatest connectivity. Individuals with high eigenvector scores have many connections, and their connections have many connections, and their connections have many connections … out to the end of the network" (www.activatenetworks.net). Eigenvector centrality has been used in a number of applications, most notably Google's PageRank algorithm.
While it could theoretically be possible to use eigenvector centrality as the central mechanism for an entire ratings system, for my purposes I am using it much more limited capacity. It can be used to set a baseline degree of centrality that a team must meet for their results to be included in the Glicko calculation. Here, a high eigenvector centrality is not necessarily an automatic virtue (you could be highly interconnected and yet lose a lot of debates), but a very low EV score could be grounds for exclusion because of the unreliability of the data.
Below I hope to show that setting a low EV hurdle is both a means to sidestep the tournament inclusion problem as well as a way to improve the overall accuracy of the ratings.
Effect of EV Baselines on Ratings Accuracy
To measure the effect of setting EV baselines on ratings inclusion, I calculated ratings from the 2013-14 season using six different sets of round data: (1) all rounds at majors and large regionals, (2) all rounds at all tournaments (with at least 20ish teams entered), (3) all rounds with teams above .1 EV, (4) all rounds with teams above .15 EV, (5) all rounds above .2 EV, and (6) all rounds above .25 EV. I then conducted two "experiments" using those ratings. First, I used them to predict the results of the 2014 NDT. Second, I ran a simple bootstrap simulation of 10,000 tournaments, using the various ratings to predict their results.
First, the NDT. Lower scores mean less error in the predictions. The number of teams listed includes a lot of partial repeats due to debaters switching partners.
MSE and MAE are pretty basic measures of error, but they have the advantage of being intuitive. MAE, or mean absolute error, simply measures the average amount that the ratings miss per round. Here it is represented as a percentage, meaning that all of the ratings miss by an average of roughly 25%. If this sounds like a lot, it is important to keep in mind the way that prediction error is calculated. Let's say that the ratings pick Team A as a 75% favorite (3:1) over Team B. If Team A wins in front of a single judge panel, then they score "1," or "100%." The ratings error is calculated by subtracting the outcome (1.00) minus the predicted outcome (0.75), meaning that despite picking Team A as a heavy favorite, there was still an error of 25%. It is also this difference between actual outcome and prediction that forms the basis of the calculation that determines how and how much the ratings change for each team after the round. MSE, or mean squared error, is only mildly more complicated. Its importance here is that it more heavily weights "big misses."
Looking at the numbers, we can spot a few things:
Now to the simulated tournaments. You may notice a significant difference in the size of the numbers. This is because of the way that the ratings calculate error in rounds with multi-judge panels. The ballot count on a panel produces a fraction of a win ranging from 0 (unanimous loss) to 1 (unanimous win), with various percentages in between (for example, a 3-2 ballot count would count as a 0.6 win). This allows the predictions to get much closer to the correct outcome on panels than it does when the outcome is a single judge binary win (1) or loss (0). Since the NDT is judged entirely by panels, whereas the bootstrap simulation draws from the entire season of results, the majority of which are standard single judge prelims.
To help address this difference, I've also added a third metric that is a bit better at evaluating the error in binary outcome situations called binomial deviance.
EDIT: In the original version of this post, there was a significant error in the way that binomial deviance was being calculated. As a result, I had to rerun the simulation. In the end, the correction only very mildly influenced my conclusions, giving a tiny bit more evidence in favor of the EV .15 ratings.
Bootstrap Simulation Tournament Predictions
Things to note:
Both analyses are based on a single seasons worth of data, but they both seem to confirm the same basic conclusion: basing the ratings on either just the majors & large regionals or on nearly all tournaments produce the least accurate predictions. By contrast, setting a eigenvector centrality baseline of somewhere around 0.15 or 0.2 produces the most accurate ratings.
Next, I am going to look at how these ratings match up with 1st & 2nd round at-large bid voting. While this voting admittedly comes from a small group of individuals, it should provide us at least one point of comparison to see whether the ratings produce outcomes in line with community expectations.
Effect of EV baselines on At-Large Bid Ranking
For the sake of simplicity, I'm only going to produce three sets of rankings to contrast with the voters: (1) all rounds from all majors & large regionals only, (2) all rounds from all tournaments with at least 20ish entries, and (3) all rounds from teams with an EV centrality above 0.15.
First Round At-Large Ranking
The first thing you should notice is a striking amount of similarity. The rankings produced by the Glicko system are pretty similar across the board with minor variations. Additionally, they produce a nearly identical set of bid teams. If Harvard HX gets excluded for being the 3rd Harvard team, then
One may wonder about the Wake Forest MQ discrepancy. This was discussed in my previous post about round robins, and it is largely due to the distorting effect of the Kentucky RR, the removal of which would bump them up into the 9-11 range.
Some notable, but relatively small, gainers are Michigan AP, Wake Forest LW, Georgetown EM, Michigan State RT, West Georgia AM, and Harvard HX. Moderate losers include Towson JR, Berkeley MS, and Texas FM. On the whole, however, teams are pretty close to the spots assigned by the voters.
The average (mean) deviation for each of the rankings from the voters are in the neighborhood of 2.1 spots. And though I didn't bother to run the calculations again with fixes made to the Kentucky RR, my rough estimate is that making those changes could get the mean deviation somewhere close to 1.5 or 1.6 Such a deviation would be in line with the average voter's difference from the aggregate voting preference.
One final observation: it makes intuitive sense that the different Glicko ratings wouldn't be too far off from one another for these teams. They mostly attend majors and probably don't compete as much against those teams that would either be included with the All Rounds rating or those that would be excluded by the EV baseline ratings.
Now, for second rounds. One important factor here is that the Glicko ratings did not include district tournaments in their calculations. If a team had an exceptionally strong or weak district tournament, it may have been accounted for by the voters, which would produce a discrepancy.
Second Round At-Large Voting
Overall, the mean deviance of the ratings from the voting average was 2.83 for the Majors, 2.5 for All Rounds, and 2.39 for EV .15, making it the closest. In fact, all of the ratings were closer to the voting average than were the average of each individual voter's deviance, which was 3.13 spots. More narrowly, only 3 individual voters were closer to the voting average than the EV .15 ratings (DCH, Lee, and Repko).
Based on the above findings, I'm in the process of reworking the 2014-15 ratings to include all rounds from (nearly) all tournaments with participants who are above a baseline 0.15 EV score. I think this checks all the advantage boxes. It's more "fair," it involves less personal judgment on my part, and it is ultimately more accurate anyway.
One might wonder as to why I've said "nearly" all tournaments. This is mostly a function of time. It takes a not insignificant amount of time to format, clean, and process each tournament's results. Then, I would need to further test them to make sure that nothing super kooky happens with extremely small tournaments. So, for now, I'm keeping it at tournaments that have at least 20ish teams. I might dip below that if it's close enough to round up. Or I might include a smaller tournament if there is a specific reason that it ought to be included.
Here is the list of tournament that fell into each category above...
Majors & Large Regionals (15 tournaments): UMKC, GSU, Kentucky RR, Kentucky, UNLV, Harvard, Wake Forest, USC, UTD, CSUF, UNT, Dartmouth RR, Pittsburgh RR, Weber RR, Texas.
(Nearly) All Tournaments (30 tournaments): UMKC, Northeast Regional Opener, GSU, Gonzaga, Kentucky RR, Kentucky, JMU, KCKCC, ESU, UNLV, Harvard, Clarion, UCO, Vanderbilt, CSUN, Wake Forest, Weber, USC, UTD, CSUF, UNT, Chico, Navy, Dartmouth RR, Indiana, Pittsburgh RR, Weber RR, Wichita State, Georgia, Texas.