The ratings have been updated to include the Kentucky round robin.
The "change" column reflects the change from the data set that did not include the RR to the data set that does (in other words, the change from post-Wake results without the RR to post-Wake results with). As a result, relatively few teams show much of any change. I might only leave this up for a short time, then post the data that shows the change as if the RR had been included all along.
A few things to note:
1. The only team that was substantially affected was Kansas CK, who had an exceptionally poor Run for the Roses. I can only hope that they don't still play "You Can't Always Get What You Want" after each round is announced. Below you can see the difference between the two ratings, the blue line representing the ratings that didn't include the round robin and the red line representing those that did. Time periods represent blocks of tournaments (1 = season openers, 2 = Kentucky & Weber, 3 = UNLV & Harvard, 4 = Wake).
Some thoughts about what's going on here. I think it's obviously fair to say that going 0-8 at the Kentucky RR is not the equivalent of going 0-8 at any other tournament. To some degree, the ratings already account for this because KU still left the Kentucky weekend with just under a 1600 rating, which is an above average rating. Nevertheless, it's also true that KU's rating at that point almost certainly underrated them. This is a big reason why one of the strengths of the ratings is that they explicitly incorporate a "deviance" element. KU's deviance at that point was still around 100, which means that the system believed that their "true" rating could possibly be as high as 1800. Another mechanism that the ratings use to account for substantial under (or over) rating can be seen pretty easily in the way KU's ratings reacted after the RR. When rated lower, their rating grew much faster after Wake and Harvard in response to their stronger results. In the end, the difference between the two data set produces a difference of about 70 points with a deviance just over 60 points. Over the course of the season, the two ratings will grow closer and closer together.
2. Harvard BS dropped from 2nd to 4th with the inclusion of the RR, falling behind Harvard DH and Michigan AP, although they're all quite close. Harvard DH and Michigan both had quite strong round robins (1st and 2nd respectively). Still, some might question whether they should jump ahead of BS, who has had an exceptional season so far (32-3 record at included tournaments). This is certainly debatable. However, it should be noted that BS's unadjusted rating is still the best of the three. It's just that they have fewer debates and so their deviance is currently higher, which drops their adjusted rating (for more on why the rankings are based on the adjusted ratings, check the FAQ. Over time this discrepancy should even out. By the end of the season, most active teams end up with deviances fairly close to one another.
The ratings have been updated with the results from Wake Forest. A couple of things to note:
First, the Kentucky Round Robin is still not included in the ratings. As soon as that data becomes available, I will include it.
Second, I've taken down the weighted ratings for the time being. Though they produce a slightly more accurate prediction, a couple of factors have prompted me to put them on the back burner. Most important among these factors is that I've been trying to clean up the data set to make sure there are no mistakes. I have this year fairly well set, but haven't had time to work through past data. It's also a more complicated calculation and I want to think a bit more about how to best implement it.
Third, I've posted a neat little prediction calculator that tells you what result the system would predict based on the current ratings of two teams.
Fourth, I've required that the debaters must have traveled as partners together to at least 2 tournaments in order to be listed.
Sometime in the coming days or weeks, I'll try to take a closer look at some of the changes in the ratings from Pre- to Post-Wake Forest. However, one stands out to me immediately and invites brief commentary.
Harvard DH jumped considerably in the most recent ratings -- from 9th to 3rd -- bypassing even Michigan AP. This might be a surprise to some since they shared the same 7-1 prelim record and Michigan made it to Finals while Harvard lost in Semis.
I think that this example really shows the strength of Glicko ratings. Though Michigan made it to finals, I would argue that Harvard really had the better tournament in terms of their wins and losses. Of course, there are the obvious facts that Harvard won the head-to-head matchup against Michigan and that Harvard's only losses were to Northwestern MV. But more significantly, Harvard's average opponent had a rating of 1765 compared to Michigan's opponent rating of 1719. To put it in more concrete terms, Harvard's average opponent would have been slightly more than a 56:44 favorite over Michigan's average opponent.
Earlier I looked at what the ratings ballot would have looked like for 2013-14 first round at-large applicants. The ratings produced a ballot very close to the consensus of the voters. This time I want to look at how it would have ranked the second round applicants.
As an initial impression, I figured that there's a possibility that the first rounds would be easier for the ratings because those at the top achieve greater differentiation since they tend to debate each other more due to elims, round robins, and how power matching produces something like a normal distribution of records. Once you get further down in the ratings, teams start to get more clustered. Here is a histogram of the adjusted weighted ratings distribution for the regular season last year.
As you can see, the frequency starts to go up pretty substantially around 1500 until it peeks around 1200. With the exception of a few outliers at the top, the range of the majority of the second round applicants was between 1500 and 1200. The mean and median of the second round applicants was right around 1400. On the whole, there are about 100 team with ratings above 1400, 170 teams above 1300, and 230 teams above 1200 (keep in mind these numbers include a lot of repeats because of mixed partnerships).
So how would the ballots produced by the ratings compare to the actual votes? Here's the table:
The ratings ended up fairly well resembling the consensus of the voters. The average difference between the voters and both sets of ratings was about 3.1 spots. By contrast, the average deviation of each voter from the aggregate vote was about 3.3 spots. The weighted ratings chose 14 of the same teams as the voters while the unweighted ratings chose 13. Things get a little tricky because of the limitations placed on the number of schools that may qualify a 3rd team.
Instead of focusing on each of the differences, I think it will be more useful to look at the instances where there's a noticeably large gap between the ratings and the voters. In nearly each case, the discrepancy can probably be accounted for by the fact that each team had relatively strong performances at smaller regional tournaments that were not included in the data set.
Oklahoma BC is one of the big losers because even while they end up ranked 11th in both sets of ratings, they drop behind other 3rd teams and get eliminated. OU is most harmed by the exclusion of Wichita State, where they closed out finals. Looking at the field, I might have to reconsider whether Wichita State should be included in the ratings, especially because there were at least a few second round applicants in the field.
Cal-Berkeley EM dropped 13 spots, most likely attributable to the exclusion of the Gonzaga and UNLV tournaments, where Cal performed pretty well. Although, the ratings also didn't include Chico, where they had a pretty poor performance.
Gonzaga BJ dropped 10 spots due to their focus on mid-level regional tournaments (Gonzaga, Lewis & Clark, UNLV, Weber, Navy). They only attended 3 tournaments included in the ratings (UMKC, Fullerton, Texas), where their performances were not strong.
One team that significantly benefited was Baylor BE, who jumped 10 spots. They also had a few of their regular season tournaments not counted (UNLV, UCO, WSU). However, my guess is that they benefited most from 2 things. First, the ratings gave them a substantial amount of credit for their performance at Texas. That single tournament, where they went 5-3 and failed to break, jumped them from 24th to 16th. The reason for the jump is that they were significant underdogs in all 3 of their losses (Towson JR, West Georgia AM, Oklahoma BC), which would have heavily mitigated any ratings losses. They also had a pretty big win against Liberty CE. The second factor is more speculative, but I suspect that the exclusion of districts results from the ratings probably benefited Baylor, whose quite poor performance may have influenced the voters.
In the end, I think that the results indicate that the specific needs of second round bids (who tend to travel to more regional-level tournaments) might require further consideration of which tournaments get included in the data set. Some of this may have already been remedied for the 2014-15 ratings through the inclusion of the UNLV and Weber State tournaments. I'll have to take a second look at Wichita State. In the future, I may rerun the ratings with a more inclusive tournament schedule to see how it affects the results. One possible middle ground could be to include elimination round results but not prelims from smaller tournaments. This might give teams additional credit for strong performances but avoid the potential distorting effects of an isolated pool of debaters.
The addition of Wichita State tournament results bumps Oklahoma up to the 4-5 range, much more consistent with their rank by the voters.
In earlier posts, I discussed the final end of year ratings from the 2013-14 data and the selection of the first round at-large bids for the NDT. Another way to evaluate the system is see how well the ratings at the end of the regular season are able to predict round results at nationals. Because of the gradual accumulation of data over the course of the year, this should be the point at which the ratings can claim their strongest accuracy.
Prior to each matchup, a prediction is made on the probable outcome based on each team's rating. Without going into too much detail concerning the formula, which can be found here, a ratings difference of 100 suggests a 64% chance of winning for the favorite. A difference of 200 translates into about 76%, and a 400 point difference is about 91%. In addition, the ratings deviation is factored into the prediction. The results of the round are then scored (1 for a win, 0 for a loss, the fraction of the ballot count for a panel), and the difference between the predicted outcome and the actual outcome becomes the basis for the updated post-matchup ratings.
For example, say the system gives Team A a 73% chance of winning against Team B. If Team A wins on a 3-0, then they exceeded the prediction (100% > 73%), and Team A's rating will rise marginally. However, if Team A only wins by a ballot count of 2-1, they actually fell short of expectations (67% < 73%), and their rating will actually go down a hair. If, on the other hand, Team A loses all 3 ballots, then the ratings loss will be relatively larger than their gain for a win would have been (because the difference between 0% and 73% is greater than the difference between 100% and 73%).
So one thing that we want to do is minimize the error in the round predictions. Obviously error can't be eliminated because upsets do occur (especially among those with fairly close ratings). Even if a favorite is 73% likely to win, that still means that the system thinks that they should lose 27% of the time. This effect is magnified the smaller the ratings difference. A 45% underdog should still win nearly half their debates against a 55% favorite.
Nevertheless, we should be able to gauge how well the ratings are working based on how large the difference is between their predictions and the actual results. Looking at the NDT gives some advantages because it's the one tournament where the results are not merely binary (1 or 0) because ballot counts create more of a spectrum (0, .33, .67. 1).
The mean absolute error for the predictions at the NDT was .253 for the weighted ratings and .257 for the unweighted ratings. That means that the predictions were on average within a ballot of the actual results. Below is a histogram showing the absolute values of the errors for all rounds with the weighted ratings. Lower is more accurate, higher is less accurate.
Since we're dealing with judge panels at the NDT, we can be a bit more precise than what's allowed by the binary win/loss structure of most tournaments. It's possible to crudely translate the degree of error into more concrete terms as ballot counts. In that case, we get these rates of error:
0 - 0.5 ballots: 40.1%
0.5 - 1 ballots: 29.3%
1 - 1.5 ballots: 19.9%
1.5 - 2 ballots: 8.7%
2 - 2.5 ballots: 2.2%
2.5 - 3 ballots: 0%
I feel that these are pretty good numbers. It means that the ratings picked roughly 90% of rounds "correctly" (meaning an error of under 0.500), with 70% being under a ballot away. Additionally, as I look at the data, the vast majority of the instances of 1.5 - 2 ballot error were occasions where the ratings predicted fairly even matched opponents but the slight underdog ended up winning on a 3-0.
It's hard to know exactly how "good" these predictions are in some abstract sense. Since we (wisely) don't gamble on debate rounds, we don't really have a way to measure what the community consensus on round odds would be. As a result, there's no comparison point against which to evaluate the predictions. I'm actually somewhat surprised at how good the predictions ended up, especially considering how power-matching is intended to pair teams with like ability.
It bears repeating that no prediction system can be 100% accurate (or even close) because there is variation in debate results with various degrees of upsets. Indeed, it would be somewhat sad if we could develop a prediction system that were too accurate because it would imply that round results are predestined. Furthermore, even a "correct" pick will be calculated as having some degree of error. Hypothetically, the ratings could pick a team as a 95% favorite, but when they win there will still be a 5% "error." In fact, it is this error that makes the ratings run because the degree of error is what determines a team's ratings change after a victory or loss.
One of my hopes when putting together these ratings is that they could be a help in the selection of various awards or recognitions, in particular the selection of at-large bids for the NDT. Assessing the validity of the ratings presents an interesting dilemma because the only real external source of validation is the intersubjective consensus of the community. While it may be possible that the ballots of the voters in the at-large process may not be identical to the community consensus, they are nevertheless likely indicative of the consensus of those who hold a certain amount of institutional and social power.
The table below shows what the "ballot" produced by the weighted and unweighted ratings would have been for the 2013-14 First Round at-large bids. Again, it's important to note that no effort was made to "fit" the results to match the coaches' ballots. To the extent that any fitting has been done, it was exclusively to optimize how well the ratings predicted actual round results.
It's interesting to note that both ratings systems would have produced almost the exact same set of bids as the actual voters did. The only point of disagreement is that the ratings didn't like Kansas BC quite as much as the voters. This is pretty remarkable, especially when emphasizing that Kansas was the 16th and final bid.
The ratings would have prefered a couple of teams before Kansas, including Harvard HX (who was ineligible), Oklahoma LM, and Minnesota CE. However, it should be noted that in the raw rating score Kansas, Oklahoma, and Minnesota were virtually identical with only a couple of points separating one another (OU and UM are even tied in one). It would be interesting to go back and examine each team's results more closely. In broad strokes, I can see why KU and OU would be so close to one another. There are few major differences in their performances. KU made it to finals of UMKC whereas OU attended the Kentucky RR. OU didn't break at Harvard, but KU didn't break at Wake. KU made it a little further at Fullerton, but OU made it a little further at Texas. The bigger suprise is the presence of Minnesota, who regularly struggled in early elims. However, they did break at every tournament. The biggest piece in their favor though is probably their performance at the Pittsburgh RR, where they substantially outshone KU and OU. Without going back to dig into the data, I suspect that it was at Pittsburgh that Minnesota got boosted back into the conversation.
Out of curiosity, I compared the ratings to each voter's ballot. The voter whose ballot most resembled the weighted ratings was Will Repko, the difference between them only being on average (mean) 1.25 spots. The voter who most resembled the unweighted ratings was Dallas Perkins, with an average difference of 1.75 spots. To put those numbers into a little bit of perspective, Dallas's average difference from Repko was 2.33 spots.
Also, the weighted ratings were slightly more aligned with the overall preferences of the voters. The average devation of the weighted ratings from voter preferences was 2.1 spots, whereas the average deviation of the unweighted ratings was 2.3 spots. I suspect that a big part of that difference can be found with the significant difference between how the two ratings evaluated Wake MQ. The unweighted version was not very friendly to them (a huge factor being the difference in how the weighted ratings evaluated the quality of their opponents at the Kentucky tournaments).
The ratings of previous seasons can be a useful heuristic to see how the ratings play themselves out over the course of the year. I produce two sets of ratings: one "weighted" and one "unweighted."
The only only difference between these two systems is the starting rating of teams at the beginning of the season. In the unweighted ratings, all teams start at the same rating (1500) and can go up or down from there. One possible weakness with this approach is that it doesn't adequately account for the quality of the opposition in the early stages of the year. An alternative is to use some prior assumption to assign preseason ratings to teams, which would presumably increase the accuracy of the predictions. The disadvantage to this is that it could conceivably give an advantage or disadvantage to a team based on their previous year's level of success, which doesn't seem "fair" given the way that we tend to think about each season as discrete from the next. For a bit more on this dilemma and how each of the systems tries to minimize the dangers, read the section in the FAQ on the difference between the weighted and unweighted ratings.
Below are the final ratings from the 2013-14 season. One thing that's noticeable is that once all the ballots are in, there doesn't end up being much of a difference between how the two systems rank teams. There are a few exceptions, but for the most part teams tend to be in very similar locations in the order. The real difference is in the ratings spread. The weighted ratings produce a bit more differentiation between teams, and thus seem to be statistically a bit more accurate in their predictions. This could ultimately use more analysis.
There was no attempt to "fit" the data to the general community consensus about who the "good teams" are. The only fitting done was to optimize the values of the calculation's variables to reduce the error between its predictions and the actual outcomes of future rounds. This optimization is important because the accuracy of the predictions are the basis of the entire system.
One obvious oddity in the ratings is the placement of Houston Lanning & Bockmon, which are far too high for their performance (2 tournaments with doubles losses). The placement is due to a current limitation of the system in how it calculates mixed partnerships. Lanning continued to accrue points over the course of the year even after he stopped debating with Bockmon. Since the ratings are the mean of each individual's rating (and not exclusively just how they have debated together), Houston BL gradually got boosted by the results of Lanning debating with Rajwani. This problem is discussed somewhat in the FAQ, and it is one that bears further examination for a better solution.
The table only includes the top 100 and is sortable. If you have any questions about how the ratings are calculated, visit the FAQ page.
Final 2013-14 Adjusted Ratings