College Debate Ratings
  • Team Ratings
    • 2020-21
    • 2019-20
    • 2018-19
    • 2017-18
    • 2016-17
    • 2015-16
    • 2014-15
    • 2013-14
    • 2012-13
    • 2011-12
    • 2010-11
    • 2009-10
  • FAQ
  • Opt Out

Ratings as Predictor of Bid Votes

12/6/2018

0 Comments

 
The computer ratings posted on this site are not intended to be predictors of how directors will vote in the first round and second round at-large bid process. Bid voters take into account a variety of criteria, including overall record, travel schedule, head-to-heads against other bid teams, "elim depth," perceived talent, and perhaps even some non-competitive qualities. In short, bid voters evaluate resumes to select who they perceive to be the most deserving teams.

The ratings, on the other hand, are singular in their purpose, and they are calibrated to maximize their accuracy at answering one question: who would win? A team's rating number is nothing more than an expression of a relative win probability. Take two ratings, feed them into the algorithm, and they will spit out the odds that one team or the other will win a given debate. For example, a team with a rating of 25 is expected to beat teams with ratings of 21 roughly three out of four times. This is pretty much the beginning and end of what the ratings do.

Nevertheless, I think there is value in thinking about how the ratings relate to the bid process. I do hope that the ratings can be a useful tool for voters - one metric among the many that they may consider. Furthermore, even though they aren't in any way calibrated to replicate the bid vote, the bid vote remains something of an external check on their validity. We can think of voters as something like a proxy for the collective opinion of the community (with all of the attendant problems of representation). If the ratings don't tend to correlate with bid outcomes, then there would perhaps be reason to question their usefulness (or, I suppose, the bid process itself). 

Toward that end, this blog post shares some data concerning how well the ratings match up with the bid votes. The short version is that they're not perfect, but they do pretty well. The ratings are well within the range of error that we find among human voters. 

Method

I collected the first and second round bid votes for each season stretching back to 2012-13 (the first year in my ratings data set). For each season, I compared each individual voter's rankings against the aggregate average of all of the voters, giving me the "error" of each voter (using RMSE for those of you who are interested). Then I created hypothetical "ballots" for how the computer ratings would have voted in each race and found their error as well. Next I calculated the average amount of error among voters and how much each voter performed above or below average. Finally, I averaged each voter's performance over the course of the past 6 years, using standard deviation to normalize the data across seasons.

Results

First Round Bids

Across all voters, the mean error for first round ballots was 1.472. Perhaps this is an oversimplification, but one way to think about this is that voters were on average off in each of their rankings by 1.472 slots (weighting to penalize larger misses more). By contrast, the computer ratings had an error of 1.759, meaning that they performed slightly worse than the average voter. However, they were still within the overall range of human error, ranking 17th out of the 21 voters in the data set -- 0.559 standard deviations below average.

Although counting "hits" and "misses" isn't a very good metric for evaluating accuracy, it's still kind of interesting to look at. The ratings have correctly chosen 15 of the 16 first round recipients in each of the last six years, missing one each year. The average hit-rate among human voters is 15.381.

Second Round Bids

In contrast to the first round data, the computer rating system performed slightly above average in its second round rankings. The mean error among voters was 3.993, while the average error of the ratings was only 3.742. The ratings were the 8th most accurate out of the 21 voters, coming out 0.359 standard deviations better than average.

I didn't calculate hits/misses for second round bids because of the complications introduced by third teams.

Final Thoughts

I went into this assuming that the ratings would do better with first round bids than second rounds. There's generally more data on first round teams, and there is greater separation between teams at the top. In contrast, teams in the middle of the pack tend to group together without much differentiation. I had assumed that the ratings would struggle more with the small differences found in the peak of the bell shaped curve.

In a strict sense, the computer ratings were more accurate with first rounds. The error for the ratings in the first round votes was less than half what it was for the second round votes. However, their performance relative to human voters flipped around.

I can only speculate why this might be the case. It's possible that factors that exist outside the strict Ws and Ls of debaters' ballots play more of a role in first round voting ("elim depth" and narrower head to head comparisons come to mind as possibilities). Similarly, its possible that the amount and/or type of data available for the second rounds just doesn't produce as clear of a hierarchy for human voters to identify, and so the ability of the ratings to assimilate a large amount of information allows for them to gain ground on the humans.

All told, the ratings seem to be a reasonable indicator for bid vote outcomes. They can't be taken as determinative, and there are certainly occasions when they are significantly off about a team (which is also true of human voters). Nevertheless, they have been pretty squarely within the range of error displayed by human voters.
0 Comments

Debate Ratings - Fall 2018

12/2/2018

0 Comments

 
Ratings for Fall 2018 are posted.

There is one change from last year. Since we are still early in the season, I've cut the list down to 75 teams from 100. At the next update, it will return to 100. The reason for the change is that the actual rating of the 75th team right now will much more closely resemble the rating necessary to be in the top 100 later in the season. The fact that there are many teams that are currently excluded from the list for insufficient data means that other teams who will be *significantly* outside the top 100 end up getting listed. For perspective, the team that would be listed at 100 would actually be at 161 if there were no minimum round threshold.

I can see an argument that I should leave them in since it gives them some valuable recognition, but I also think there's an argument that it creates unrealistic expectations, misunderstanding and disappointment at the next update when they almost inevitably fall out of rankings - especially if they even improve their performance in the second semester.

Some reminders about the ratings:
  1. If you feel strongly that you belong on this list but don't see your name, it may be possible that this is due to the ratings not having enough data on you.  The deviation for your rating has to be below 2.0 to be listed (which roughly amounts to around 24 rounds). 
  2. These are not my personal opinions.  The algorithm is set and runs autonomously from how I may personally feel about teams.  I do not put my finger on the scale.
  3. The ratings are determined by nothing more than the head to head outcome of debate rounds.  No preconceptions about which schools or debaters are good, no weighting for perceived quality of tournaments, no eye test adjustments.  If you beat somebody, your rating goes up and theirs goes down.  If you beat somebody with a much higher rating, it goes up more.  If you beat them in elims, it will go up by more than if you do so in prelims.  That's it.  If you want all the gory details, follow the link given below.
  4. The quality of the ratings is limited by the quantity and quality of the data available.  It is still early in the season and a whole lot of teams haven't seen one another.  The geographic split in tournament travel makes things even more complicated.  Teams listed high or low right now might see considerable changes in their rankings over the course of the season.  It is entirely possible (even certain) that there are teams that have not performed in a way that's consistent with how good they "really are."
  5. For a more detailed description of how the ratings are calculated, there are a number of posts in the archives that explain the process.  In particular, this post will be helpful.
  6. If you are attentive to the rating number - not just the ranking - it will help you to understand even large differences in ranking might not amount to very much difference between teams.  For example, teams that might be separated by 10 or even 20 ranking spots might only be separated by a point or two in rating.

For a sense of what the ratings number actually means:
  • A 1 point ratings advantage translates roughly into 5:4 expected odds,
  • 2 points is about 3:2
  • 3 points is about 2:1
  • 4 points is about 3:1
  • 5 points is about 4:1
  • 8 points is about 9:1
0 Comments

Debate Ratings - Winter 2018

2/6/2018

0 Comments

 
Ratings for the second semester have been updated. They reflect the entire 2017-18 regular season. It has been a really interesting season, and congratulations to everybody!

I will emphasize that to get the most insight from the ratings, one should not dwell too much on the ordinal ranking of teams. Attention to the often minimal difference in team rating will often show that teams separated several ranks are in fact expected to be functionally at 50/50 odds in a head to head matchup. Sometimes these very small differences can be as much due to the randomness of their draws or differences in travel schedule as anything else.

For a sense of what the ratings number actually means:
  • A 1 point ratings advantage translates roughly into 5:4 expected odds,
  • 2 points is about 3:2
  • 3 points is about 2:1
  • 4 points is about 3:1
  • 5 points is about 4:1
  • 8 points is about 9:1

Reminders about the ratings:
  1. If you feel strongly that you belong on this list but don't see your name, it may be possible that this is due to the ratings not having enough data on you.  The deviation for your rating has to be below 2.0 to be listed (which roughly amounts to around 24 rounds). 
  2. These are not my personal opinions.  The algorithm is set and runs autonomously from how I may personally feel about teams.  I do not put my finger on the scale.
  3. The ratings are determined by nothing more than the head to head outcome of debate rounds.  No preconceptions about which schools or debaters are good, no weighting for perceived quality of tournaments, no eye test adjustments.  If you beat somebody, your rating goes up and theirs goes down.  If you beat somebody with a much higher rating, it goes up more.  If you beat them in elims, it will go up by more than if you do so in prelims.  That's it.  If you want all the gory details, follow the link given below.
  4. For a more detailed description of how the ratings are calculated, there are a number of posts in the archives that explain the process.  In particular, this post will be helpful.
0 Comments

Debate Ratings - Fall 2017

11/14/2017

0 Comments

 
Team ratings for the fall semester are now up. Congrats to everybody who is out there doing work and making it happen, especially those of you who are less experienced or who may not be listed on these rankings. Debate is hard, and the biggest hurdle is just showing up. I have personally seen a number of excellent debates this season and am looking forward to next semester.

Some reminders about the ratings:
  1. If you feel strongly that you belong on this list but don't see your name, it may be possible that this is due to the ratings not having enough data on you.  The deviation for your rating has to be below 2.0 to be listed (which roughly amounts to around 24 rounds). 
  2. These are not my personal opinions.  The algorithm is set and runs autonomously from how I may personally feel about teams.  I do not put my finger on the scale.
  3. The ratings are determined by nothing more than the head to head outcome of debate rounds.  No preconceptions about which schools or debaters are good, no weighting for perceived quality of tournaments, no eye test adjustments.  If you beat somebody, your rating goes up and theirs goes down.  If you beat somebody with a much higher rating, it goes up more.  If you beat them in elims, it will go up by more than if you do so in prelims.  That's it.  If you want all the gory details, follow the link given below.
  4. The quality of the ratings is limited by the quantity and quality of the data available.  It is still early in the season and a whole lot of teams haven't seen one another.  The geographic split in tournament travel makes things even more complicated.  Teams listed high or low right now might see considerable changes in their rankings over the course of the season.  It is entirely possible (even certain) that there are teams that have not performed in a way that's consistent with how good they "really are."
  5. For a more detailed description of how the ratings are calculated, there are a number of posts in the archives that explain the process.  In particular, this post will be helpful.
  6. If you are attentive to the rating number - not just the ranking - it will help you to understand even large differences in ranking might not amount to very much difference between teams.  For example, teams that might be separated by 10 or even 20 ranking spots might only be separated by a point or two in rating.

For a sense of what the ratings number actually means:
  • A 1 point ratings advantage translates roughly into 5:4 expected odds,
  • 2 points is about 3:2
  • 3 points is about 2:1
  • 4 points is about 3:1
  • 5 points is about 4:1
  • 8 points is about 9:1


0 Comments

More on Speaker Points

7/2/2017

0 Comments

 
I have posted more data on how judges in the open division assign speaker points (which can be found under the "Speaker Points" tab above).  Now there are tables going back to the 2014-15 season.

Perhaps more importantly, I have also posted a table of weighted averages that combines the data on each judge over the past three years to show what their average points have been and how they have trended this year (or the most recent year for which we have data on them).  This table progressively weights the data so that more recent seasons count more than previous ones.

For a more detailed description of how to read the tables, refer to my previous post from April.
0 Comments

Judge Data on Speaker Points

4/2/2017

0 Comments

 

Normalizing Points

I have posted a table breaking down how each judge in the community distributes speaker points under the "Speaker Points" tab.  

My goal is to provide people more information about how speaker points get assigned so that hopefully we can all make more informed decisions.  My hope is that this information is not used to single out any particular judge or judges for criticism.  Instead, it is an attempt to make a relatively opaque process somewhat more transparent.  Assigning speaker points is not an exact science.  Nor is it completely arbitrary or capricious.  Hopefully, judges can use this information to better understand how their points relate to the community at large.

To be clear, I do not believe that there is any such thing as "correct" points.  Similarly, there is no single rubric for what counts as a good speaker.  Every judge values different things about a speaker, and that should be celebrated.  Furthermore, beyond random variance, there may be a good reason for a judge's points to diverge if they value qualities in speakers that are disproportionately undervalued by the rest of the community.  The aim in normalizing point distributions is not to get everybody to agree about what counts as a good speaker.  Rather, it is to get everybody to use a common language in scoring.  We may disagree about what "good" means, but for speaker points to work, we need to know that when I think a speech is good that I'm giving similar points as you are when you think a speech is good.

I apologize that the table is not necessarily presented in a format that is super easy to understand without some basic knowledge of statistics, but there is a glossary that defines each of the categories.  Furthermore, to help clarify, I will work through my own line as an example.

An Example

 Here is my breakdown:
Picture
Picture
I have judged 27 rounds this year that are included as the sample, and in those rounds the median point value that I have assigned is a 28.5.  For point of reference, 28.5 is also the median point value assigned by judges across all debates, so at first glance my average points seem spot on with the community.

However, we can see that I give slightly below average points by looking at the "Deb Med" and "Med Diff" columns.  "Deb Med" (or Debater Median) is the median points that the debaters that I have judged have gotten in all of their debates over the course of the year -- 28.6 meaning that they were slightly above average speakers.  "Med Diff" (Median Difference) is the average that I deviate from the points that those that I judge typically receive.  I have a -0.1 median difference, which means that on average, I give a tenth of a point less than what everybody else gives the same debaters.  Median difference is the simplest way to see if your points tend to deviate from average and by how much.

The next two columns ("< Med" and "> Med") go together.  They express the percentage of the time that you give points that are below ("< Med") or above ("> Med") the average points of those you judge.  Ideally, these two numbers would be equal, meaning that you give out below average points as often as you give out above average points.  However, we can see that my split is not even.  I give out below average points 67% of the time and above average points only 17% of the time.  This is consistent with what we would expect from the fact that my Median Difference is also negative.

The final four columns all go together and point to how often the judge gives points that significantly deviate from a debater's average ("SD" meaning Standard Deviation).  To be clear, we should expect this to happen.  Debaters are not robots.  They perform inconsistently, and different judges value different things in a speaker.  However, if there is a large and consistent skew toward the positive or negative, then a judge might consider whether their points are not in tune with community norms for what points generally mean.  Under the "-2 SD" column, I have a 2.3, which means that 2.3% of the time I give points that are more than 2 standard deviations worse than what those debaters usually receive.  10.3% of the time I give points that are between 1 and 2 standard deviations worse, and 2.3% of the time I give points that are between 1 and 2 standard deviations better.  I never gave points that were more than 2 standard deviations better.  To help concretize what this means a bit, points that are outside of 2 standard deviations are at the extremes, basically what we would expect to be the highest or lowest ~2% of points that that debater will receive over the course of the year.  Points that are more than 1 standard deviation are about the highest/lowest ~16%.

In sum, I gave out slightly bad points, but I should be able to address it with a fairly minor correction.
0 Comments

Debate Ratings - 2016-17 Final

4/1/2017

0 Comments

 
The final ratings for the 2016-17 season have been posted.

Congratulations to everybody on a great season!  In particular, congrats to Rutgers for their incredible performances at CEDA and the NDT, and also to Harvard for their ridiculous consistency on the way to the Copeland Award.  More than that, however, congratulations to every debater that suited up and made it to a tournament.  I was fortunate enough to catch a bunch of fantastic debates this year.  I have been consistently excited by the quality of important and provocative scholarship that I have had the great fortune to witness being explored by so many of you.

This is the last set of ratings for the season.  At some point during the summer, I will try to take stock of the current state of the ratings.  Feel free to contact me directly if you have questions, concerns, or suggestions.  I try to be as transparent and forthcoming as possible.

As usual, disclaimers:
  1. These are not my personal opinions.  The algorithm is set and runs autonomously from how I may personally feel about teams.  I do not put my finger on the scale.
  2. The ratings are determined by nothing more than the head to head outcome of debate rounds.  No preconceptions about which schools or debaters are good, no weighting for perceived quality of tournaments, no eye test adjustments.  If you beat somebody, your rating goes up and theirs goes down.  If you beat somebody with a much higher rating, it goes up more.  If you beat them in elims, it will go up by more than if you do so in prelims.  That's it.

For a sense of what the ratings number actually means:
  • A 1 point ratings advantage translates roughly into 5:4 expected odds,
  • 2 points is about 3:2
  • 3 points is about 2:1
  • 4 points is about 3:1
  • 5 points is about 4:1
  • 8 points is about 9:1
0 Comments

Debate Ratings - 2016-17 Regular Season

2/7/2017

0 Comments

 
These are the final ratings for the 2016-17 regular season, with one caveat.  I had to manually enter the results for both the Dartmouth and Pittsburgh round robins.  As far as I know, I did so without error, but the calculation will be rerun when the results from those tournaments are downloadable from tabroom.

Also, I am aware that Kansas's Robinson is listed twice (as are others potentially).  This is because she had enough rounds with two separate partners to be listed, and I don't really want to be in the business of keeping up with all of the partnership changes that happen on every team.

Disclaimers:
  1. These are not my personal opinions.  The algorithm is set and runs autonomously from how I may personally feel about teams.  I do not put my finger on the scale.
  2. The ratings are determined by nothing more than the head to head outcome of debate rounds.  No preconceptions about which schools or debaters are good, no weighting for perceived quality of tournaments, no eye test adjustments.  If you beat somebody, your rating goes up and theirs goes down.  If you beat somebody with a much higher rating, it goes up more.  If you beat them in elims, it will go up by more than if you do so in prelims.  That's it.

For a sense of what the ratings number actually means:
  • A 1 point ratings advantage translates roughly into 5:4 expected odds,
  • 2 points is about 3:2
  • 3 points is about 2:1
  • 4 points is about 3:1
  • 5 points is about 4:1
  • 8 points is about 9:1
0 Comments

Debate Ratings - Post Swing 2017

1/11/2017

0 Comments

 
Ratings for tournaments through the holiday swing are posted.

​Not a lot to say this time other than that there is still a lot of room for movement at the last couple tournaments of the regular season.  Only about a point and a half separate the teams in the top five.  This is also about the same distance separating #11 from #18.  An extra win or two over quality opponents, particularly in elims, could have a big impact for these teams.

Disclaimers:
  1. These are not my personal opinions.  The algorithm is set and runs autonomously from how I may personally feel about teams.  I do not put my finger on the scale.
  2. The ratings are determined by nothing more than the head to head outcome of debate rounds.  No preconceptions about which schools or debaters are good, no weighting for perceived quality of tournaments, no eye test adjustments.  If you beat somebody, your rating goes up and theirs goes down.  If you beat somebody with a much higher rating, it goes up more.  If you beat them in elims, it will go up by more than if you do so in prelims.  That's it.

For a sense of what the ratings number actually means:
  • A 1 point ratings advantage translates roughly into 5:4 expected odds,
  • 2 points is about 3:2
  • 3 points is about 2:1
  • 4 points is about 3:1
  • 5 points is about 4:1
  • 8 points is about 9:1
0 Comments

Debate Ratings - First Semester 2016-17

12/7/2016

0 Comments

 
The final ratings for the first semester of the 2016-17 season are posted.

One unforseen consequence of posting the previous edition of the ratings when there were so many quality teams without the required number of rounds to be listed is that it artificially inflated the rankings of a lot of teams.  As a result, many teams that were previously ranked in the top 50 dropped a number of spots without actually performing any worse.  They were just bumped down as new teams were added to the list.  In the future, I'll have to consider whether it might be better to just wait until the end of the first semester for the first release.

I wanted to wait until the coaches poll was out to post the new ratings.  I will refrain from commenting in any detail about specific teams, but it is interesting to think about the differences in where some teams are ranked.  I doubt that there is a single factor that can explain all of the instances where there is divergence between the computer rating and the human poll.  However, if I were to make a couple of guesses about what might be at work, I think the following might be relevant:
  1. It is possible that human voters are more likely to think in terms of team performance as a "resume" or "body of work."  Thus, teams that the computer ratings like because of strong head-to-head results might be disadvantaged if they have been to fewer tournaments (or less total prestigious tournaments).
  2. It may be possible that human voters are more likely to value "elim depth" with less regard for the specific opponents that teams defeated (or lost to).  The computer ratings do give extra weight to elim wins, but what matters is *who* a team competes against in elims rather than which round they made it to.  Thus, the algorithm might be more impressed with a team that took down two highly rated opponents and dropped in quarters than a team that had an easy draw to semis.
  3. For teams with a fewer than average number of rounds, it is possible that there could be a moderately outsized recency effect of their results.  Less data makes a team's rating more volatile, which means that they can move it up (or down) more quickly.
  4. It might be the case that in some instances the computer algorithm could be less forgiving of teams with inconsistent results.  While this was only a quick dive into the data (and there are not very many data points to compare), it appeared on first glance that teams that possessed both a high rate of error (performed against expectation more often) and a large number of total rounds (which should tend to reduce error) performed slightly worse in the computer ratings versus the human poll.  Just or Unjust?  You decide.
  5. Finally, UMKC.  Pretty much down the line, the human poll valued success at the UMKC tournament less than the computer did.

I hope to get my hands on the raw data from the coaches' ballots to see how much consensus/dissensus there was among the voters.  It could be useful to evaluate whether the divergence that we see with the computer rankings is within the range of human disagreement internal to the poll itself.

The usual disclaimers:
  1. These are not my personal opinions.  The algorithm is set and runs autonomously from how I may personally feel about teams.  I do not put my finger on the scale.
  2. The ratings are determined by nothing more than the head to head outcome of debate rounds.  No preconceptions about which schools or debaters are good, no weighting for perceived quality of tournaments, no eye test adjustments.  If you beat somebody, your rating goes up and theirs goes down.  If you beat somebody with a much higher rating, it goes up more.  If you beat them in elims, it will go up by more than if you do so in prelims.  That's it.
  3. It is still early in the season, so the ratings are subject to a fair amount of volatility, especially for teams that have a fewer number of rounds.  They grow more stable over time.

For a sense of what the ratings number actually means:
  • A 1 point ratings advantage translates roughly into 5:4 expected odds,
  • 2 points is about 3:2
  • 3 points is about 2:1
  • 4 points is about 3:1
  • 5 points is about 4:1
  • 8 points is about 9:1
0 Comments
<<Previous

    RSS Feed

    Archives

    December 2018
    February 2018
    November 2017
    July 2017
    April 2017
    February 2017
    January 2017
    December 2016
    October 2016
    June 2016
    April 2016
    February 2016
    January 2016
    November 2015
    October 2015
    September 2015
    April 2015
    February 2015
    January 2015
    November 2014

Proudly powered by Weebly
  • Team Ratings
    • 2020-21
    • 2019-20
    • 2018-19
    • 2017-18
    • 2016-17
    • 2015-16
    • 2014-15
    • 2013-14
    • 2012-13
    • 2011-12
    • 2010-11
    • 2009-10
  • FAQ
  • Opt Out