College Debate Ratings
  • Team Ratings
    • 2020-21
    • 2019-20
    • 2018-19
    • 2017-18
    • 2016-17
    • 2015-16
    • 2014-15
    • 2013-14
    • 2012-13
    • 2011-12
    • 2010-11
    • 2009-10
  • FAQ
  • Opt Out

Ratings as Predictor of Bid Votes

12/6/2018

0 Comments

 
The computer ratings posted on this site are not intended to be predictors of how directors will vote in the first round and second round at-large bid process. Bid voters take into account a variety of criteria, including overall record, travel schedule, head-to-heads against other bid teams, "elim depth," perceived talent, and perhaps even some non-competitive qualities. In short, bid voters evaluate resumes to select who they perceive to be the most deserving teams.

The ratings, on the other hand, are singular in their purpose, and they are calibrated to maximize their accuracy at answering one question: who would win? A team's rating number is nothing more than an expression of a relative win probability. Take two ratings, feed them into the algorithm, and they will spit out the odds that one team or the other will win a given debate. For example, a team with a rating of 25 is expected to beat teams with ratings of 21 roughly three out of four times. This is pretty much the beginning and end of what the ratings do.

Nevertheless, I think there is value in thinking about how the ratings relate to the bid process. I do hope that the ratings can be a useful tool for voters - one metric among the many that they may consider. Furthermore, even though they aren't in any way calibrated to replicate the bid vote, the bid vote remains something of an external check on their validity. We can think of voters as something like a proxy for the collective opinion of the community (with all of the attendant problems of representation). If the ratings don't tend to correlate with bid outcomes, then there would perhaps be reason to question their usefulness (or, I suppose, the bid process itself). 

Toward that end, this blog post shares some data concerning how well the ratings match up with the bid votes. The short version is that they're not perfect, but they do pretty well. The ratings are well within the range of error that we find among human voters. 

Method

I collected the first and second round bid votes for each season stretching back to 2012-13 (the first year in my ratings data set). For each season, I compared each individual voter's rankings against the aggregate average of all of the voters, giving me the "error" of each voter (using RMSE for those of you who are interested). Then I created hypothetical "ballots" for how the computer ratings would have voted in each race and found their error as well. Next I calculated the average amount of error among voters and how much each voter performed above or below average. Finally, I averaged each voter's performance over the course of the past 6 years, using standard deviation to normalize the data across seasons.

Results

First Round Bids

Across all voters, the mean error for first round ballots was 1.472. Perhaps this is an oversimplification, but one way to think about this is that voters were on average off in each of their rankings by 1.472 slots (weighting to penalize larger misses more). By contrast, the computer ratings had an error of 1.759, meaning that they performed slightly worse than the average voter. However, they were still within the overall range of human error, ranking 17th out of the 21 voters in the data set -- 0.559 standard deviations below average.

Although counting "hits" and "misses" isn't a very good metric for evaluating accuracy, it's still kind of interesting to look at. The ratings have correctly chosen 15 of the 16 first round recipients in each of the last six years, missing one each year. The average hit-rate among human voters is 15.381.

Second Round Bids

In contrast to the first round data, the computer rating system performed slightly above average in its second round rankings. The mean error among voters was 3.993, while the average error of the ratings was only 3.742. The ratings were the 8th most accurate out of the 21 voters, coming out 0.359 standard deviations better than average.

I didn't calculate hits/misses for second round bids because of the complications introduced by third teams.

Final Thoughts

I went into this assuming that the ratings would do better with first round bids than second rounds. There's generally more data on first round teams, and there is greater separation between teams at the top. In contrast, teams in the middle of the pack tend to group together without much differentiation. I had assumed that the ratings would struggle more with the small differences found in the peak of the bell shaped curve.

In a strict sense, the computer ratings were more accurate with first rounds. The error for the ratings in the first round votes was less than half what it was for the second round votes. However, their performance relative to human voters flipped around.

I can only speculate why this might be the case. It's possible that factors that exist outside the strict Ws and Ls of debaters' ballots play more of a role in first round voting ("elim depth" and narrower head to head comparisons come to mind as possibilities). Similarly, its possible that the amount and/or type of data available for the second rounds just doesn't produce as clear of a hierarchy for human voters to identify, and so the ability of the ratings to assimilate a large amount of information allows for them to gain ground on the humans.

All told, the ratings seem to be a reasonable indicator for bid vote outcomes. They can't be taken as determinative, and there are certainly occasions when they are significantly off about a team (which is also true of human voters). Nevertheless, they have been pretty squarely within the range of error displayed by human voters.
0 Comments

Debate Ratings - Fall 2018

12/2/2018

0 Comments

 
Ratings for Fall 2018 are posted.

There is one change from last year. Since we are still early in the season, I've cut the list down to 75 teams from 100. At the next update, it will return to 100. The reason for the change is that the actual rating of the 75th team right now will much more closely resemble the rating necessary to be in the top 100 later in the season. The fact that there are many teams that are currently excluded from the list for insufficient data means that other teams who will be *significantly* outside the top 100 end up getting listed. For perspective, the team that would be listed at 100 would actually be at 161 if there were no minimum round threshold.

I can see an argument that I should leave them in since it gives them some valuable recognition, but I also think there's an argument that it creates unrealistic expectations, misunderstanding and disappointment at the next update when they almost inevitably fall out of rankings - especially if they even improve their performance in the second semester.

Some reminders about the ratings:
  1. If you feel strongly that you belong on this list but don't see your name, it may be possible that this is due to the ratings not having enough data on you.  The deviation for your rating has to be below 2.0 to be listed (which roughly amounts to around 24 rounds). 
  2. These are not my personal opinions.  The algorithm is set and runs autonomously from how I may personally feel about teams.  I do not put my finger on the scale.
  3. The ratings are determined by nothing more than the head to head outcome of debate rounds.  No preconceptions about which schools or debaters are good, no weighting for perceived quality of tournaments, no eye test adjustments.  If you beat somebody, your rating goes up and theirs goes down.  If you beat somebody with a much higher rating, it goes up more.  If you beat them in elims, it will go up by more than if you do so in prelims.  That's it.  If you want all the gory details, follow the link given below.
  4. The quality of the ratings is limited by the quantity and quality of the data available.  It is still early in the season and a whole lot of teams haven't seen one another.  The geographic split in tournament travel makes things even more complicated.  Teams listed high or low right now might see considerable changes in their rankings over the course of the season.  It is entirely possible (even certain) that there are teams that have not performed in a way that's consistent with how good they "really are."
  5. For a more detailed description of how the ratings are calculated, there are a number of posts in the archives that explain the process.  In particular, this post will be helpful.
  6. If you are attentive to the rating number - not just the ranking - it will help you to understand even large differences in ranking might not amount to very much difference between teams.  For example, teams that might be separated by 10 or even 20 ranking spots might only be separated by a point or two in rating.

For a sense of what the ratings number actually means:
  • A 1 point ratings advantage translates roughly into 5:4 expected odds,
  • 2 points is about 3:2
  • 3 points is about 2:1
  • 4 points is about 3:1
  • 5 points is about 4:1
  • 8 points is about 9:1
0 Comments

    RSS Feed

    Archives

    December 2018
    February 2018
    November 2017
    July 2017
    April 2017
    February 2017
    January 2017
    December 2016
    October 2016
    June 2016
    April 2016
    February 2016
    January 2016
    November 2015
    October 2015
    September 2015
    April 2015
    February 2015
    January 2015
    November 2014

Proudly powered by Weebly
  • Team Ratings
    • 2020-21
    • 2019-20
    • 2018-19
    • 2017-18
    • 2016-17
    • 2015-16
    • 2014-15
    • 2013-14
    • 2012-13
    • 2011-12
    • 2010-11
    • 2009-10
  • FAQ
  • Opt Out