Speaker Point Inflation: Real or Imagined?

One common request over the past few months has been for an assessment of the effects of the new .1 scale on speaker points in high school debate. With most if not all tournaments now utilizing this system (or its 100-point variant), it is now possible to look back and analyze how the new scale has impacted speaker point assignment. Four major national circuit tournaments—two in the first semester and two in the second semester—were included in this study: Greenhill, Glenbrooks, MBA, and Emory. How has the .1 scale effected speaker points at these tournaments? The answer (in graphical form) is below the fold.

Greenhill

Glenbrooks

Montgomery Bell Academy

Emory

Cumulative Results

The following graph includes speaker point totals from Greenhill, Glenbrooks (normalized to six rounds), MBA, and Emory for the following seasons: 1997-1998, 2001-2002, 2005-2006, 2009-2010, and 2010-2011.

The average total for the top speaker increased minimally from 1997 to 2010 but spiked by nearly a full point in 2010-2011. This trend continued for the fifth, tenth, 15th, and 20th speakers: the totals for each position were higher in 2010-2011 than in any other season. Indeed, the average point total for the 20th speaker in 2010-2011 was the same as the average point total for the 5th speaker in 1997-1998.

To what can we attribute this dramatic increase in speaker points? The clear culprit is the mainstream acceptance of the .1 speaker point scale (instead of the .5 scale that had been the norm since the 1990s). The new scale provides judges with increased flexibility for assigning points at the higher end of the spectrum; while few judges throughout the previous decade were willing to award 29.5 points to even the most exceptional speakers, they have been willing to use the new scale to boost points up to 29.1, 29.2, and so on. Whereas a 28.5 used to be the second best total on most judges’ scales, it is now only the sixth or seventh (or even tenth or more) increment for judges using the .1 scale.

Does it matter? The advent of the .1 system has undoubtedly allowed judges to differentiate their speaker point assignments with greater precision. In the process, however, point values have dramatically inflated. Was this inflation inevitable? Will it continue? Should we care?

Please share your thoughts in the comments.

20 thoughts on “Speaker Point Inflation: Real or Imagined?

  1. rlevkov

    Obviously these kids are just better.

    Considering the unprecedented lack of flowing that has occurred over the last two years one must logically conclude that. No Flowing –> higher or points. Or conversely, flowing –> less speaker points.

  2. ellis

    Are there now more increments between TS and others than in the past? If so, it seems like that would resolve the problem with point inflation since there is a wider variation between top speaker and 20th speaker in terms of possible totals. In other words, if the difference between 28.5 and 29 on the old scale is 1 increment, and if 28.8 and 28.9 are 1 increment apart, then who cares if the scale has shifted?

  3. Jake

    It seems, theoretically that the .1 scale would allow a kid who would have gotten a 28.5 to get a 28.3, just as much as it would the inverse(28.7). It makes sense at the top that points would be higher because most judges won't give a 29.5, much less a 30, but now they have a greater ability to differentiate, and, since these are top-tier debaters, the most likely movements of the points is upward, not downward. I would be interested to see if the middle of the pool has had an increase in points, or, if the majority of kids who were getting 28's are now getting 27.8's instead of 28.2's, etc.

  4. andrew

    I think that the relative shifting of the scale as a whole matters less than the continued (and possibly increased) arbitrary/luck of the draw nature in which points are assigned, as the former would have little effect on seeding or speaker awards. While greater leeway allows judges to be more specific, it also seems to allow the scale to lose consistency from judge to judge. E.g. a speech which three judges would give a 29.5 on the .5 scale could be a 29.3 to one, 29.6 to another and a 29.5 to the one who forgets to use the .1 scale at all

  5. Tucker B.

    If everyone is on the same scale it shouldn't really affect the individual competition and speakers, should it?

  6. Ross Garrett

    I believe Roy is making fun of the common belief of many debaters that not flowing is good idea. Many lab leaders instruct students in various methods where you don't flow the whole debate just parts. However, all the top debaters do indeed flow (and usually very well).

  7. Josh Gonzalez

    New system = better. If speaker points are a statistical measure, then introducing more variance into the system is a good thing. Assuming that 27 was as low as anyone would go under the old system, there were only 7 values that you could theoretically assign to a speaker (27, 27.5, 28, 28.5, 29, 29.5, 30). Even if 28 is the bottom value in the new system (inflation is not nearly that bad, but let's accept it for the sake of argument) you still get 21 possible values you can assign. Allowing more differentiations means more precision, which is good.

    1. vinay

      Agreed – the new scale ultimately just means that a 27.5 in 2011 does not equal a 27.5 in 2001, which is only really important for the purpose of drawing comparisons between two debaters in two time periods….but nobody uses speaker points to make those comparisons anyway

      To be honest, I'm not sure why nobody has advocated a switch to standardized points -considering that , for college debate at least, all judge/speaker point allocation data is available online and all tournaments are tabulated using fancy computer software anyway, somebody ambitious could write a program that uses debateresults.com to process for each debate how many standard deviations a debater is from a particular judge's mean point allocation and assign a z-score. I think this is pretty similar to what the JVAR function in TRPC does, but this would just apply that to a much larger sample of each judges' debates – JVAR alone is not super useful if a judge only watches one round per tournament

      This would make point inflation even less important because you can still make relative comparisons, it would decrease the incentive to pref judges who give great points and vice versa, and it would dramatically improve the accuracy of speaker points, all without forcing judges to switch from the scale they're accustomed to. Just a thought, maybe someone who's better at statistics can back me up/show how horribly, horribly wrong I am

      1. Michael Antonucci

        I've been told Larson has often floated a similar idea. I guess the real barriers are:

        a. necessitates accurate reporting and data sharing – doesn't exist much in high school

        b. might introduce some really weird fluctuations with

        i. judges who are new to the pool – if I judge two rounds and one's a 26 and one's a 29, is that just AUTO-TS for the 29? There's a little bit of random there, and there are a few similar hypothetical examples.

        ii. judges who are jumping circuits and trying to vary their scale by tournament. For example, when I judged more high school, a 29 at the Harvard Round Robin meant more than a 29 at the Bedford Lasagna Luncheon and Citizen Judge Extravaganza. Your system would force me to standardize across tournaments.

        iii. No reset – some judges might just screw it up and go off the scale at some tournament. I definitely did this once when trying to adjust to the 100 point scale. No take-backs could make an early error distort your business for the whole year.

        None of this is necessarily prohibitive, but maybe merits consideration.

        PS many congrats on Berkeley!!

      2. Ryan Marcus

        Very interesting idea — went ahead and gave it a try on the 09-10 debateresults data.

        These numbers depend on:
        1. The integrity of the debateresults data (not sure how safe an assumption this is)
        2. That judges are consistant from round to round

        Also, I hammered out the code pretty quickly. Here it is: http://pastebin.com/t0TVSEFq

        If I made any obvious errors, feel free to let me know and I'll correct them.

        Here's the results:

        The top 10 speakers by total standardized points
        zak schaller , the sum of who's Z scores is 137.289376022
        josh grace , the sum of who's Z scores is 133.517859625
        william karlson , the sum of who's Z scores is 131.9629748
        evan defilippis , the sum of who's Z scores is 123.665813123
        kenny cauthen , the sum of who's Z scores is 121.394921799
        danny abbas , the sum of who's Z scores is 121.003615396
        drew mcneil , the sum of who's Z scores is 114.477534577
        nicholas rogan , the sum of who's Z scores is 113.822556314
        matt fisher , the sum of who's Z scores is 71.5780030387
        jim schultz , the sum of who's Z scores is 70.4457543523

        The top 10 speakers by total points
        beth mendenhall , the sum of who's points is 8982.0
        emily owens , the sum of who's points is 8976.33333333
        peter sadowski , the sum of who's points is 8659.33333333
        christopher thomas , the sum of who's points is 8532.0
        kristyn russell , the sum of who's points is 8501.66666667
        andy montee , the sum of who's points is 8483.33333333
        john karin , the sum of who's points is 8413.0
        derek ziegler , the sum of who's points is 8375.33333333
        adam james , the sum of who's points is 8335.0
        stefan meneses , the sum of who's points is 8307.0

        The top 10 speakers by average standardized points
        william karlson , who's average is 2.12843507743
        nicholas rogan , who's average is 1.83584768248
        evan defilippis , who's average is 1.69405223456
        kenny cauthen , who's average is 1.59730160262
        zak schaller , who's average is 1.57803880485
        josh grace , who's average is 1.53468804167
        danny abbas , who's average is 1.51254519245
        drew mcneil , who's average is 1.5062833497
        erik johnson , who's average is 1.19846391362
        kathleen nolan , who's average is 1.13029602367

        The top 10 speakers by average points
        chris thomas , who's average is 95.1111111111
        stephen weil , who's average is 95.0048309179
        tony carpentier , who's average is 95.0
        eli jacobs , who's average is 94.9114583333
        james mollison , who's average is 94.6333333333
        jim schultz , who's average is 94.6086956522
        matt fisher , who's average is 94.5043859649
        sal simeone , who's average is 94.4444444444
        molly hart , who's average is 94.4444444444
        kathleen nolan , who's average is 94.4444444444

        Clearly, the method one uses to calculate the rankings can make quite the difference. I realize I'm not doing any comparisons across seasons here — just pointing out that looking at total points or even average points vs. a standardized values creates some major differences.

        Outlier reduction is probably also needed — keep in mind that a debater who gets a 30 (or 100, or 10, whichever scale floats your boat) in front of a judge that normally gives 22s and then that debater never debates again, that debater is likely to have a very high average in terms of standardized points. Of course, this is true of regular averages as well.

        Python code should be ran with Psyco or PyPy to avoid taking years.

        Input files are Access 2007 exports of the the debateresults tables into plain text, comma delimiters.

        1. james_huang

          damn those results are funky – some of the people on that list only went to one or two tournaments the entire year and they were just local competitions
          why not use all the judging data to just calculate the top speakers at ONE tournament, like the NDT? I think limiting the data set to just the majors would resolve many of these issues as well (fairly constant judge pool, also constant competitor pool)

    2. Ryan Galloway

      Agree with Josh. While I think people really only use 5-7 Likard scale variations, people had simply stopped using 27, 29.5, and 30 on the scale almost altogether. The new system allows for greater variation, which prevents debaters getting block 28.5's. It was hard to go back to the .5 scale at the NDT this year, and I felt like more judges were giving too many debaters the exact same speaker points.

      Conflation is bad, inflation is not bad.

  8. Nathan

    what about a system in which there was a pool that vinay described, but only for bid tournaments? that eliminates the bedford lasagna issue.

    1. brian rubaie

      this is a smart improvement!

      to perfect it, you could calibrate based on the level of bid. a tournament receiving a finals bid with 30 teams is less difficult than a tournament with an octos bid with 130 contestants

    2. Michael Antonucci

      Judges point differently at different bid levels. There's often a big difference between an octas and a finals bid.

      The point system you describe builds TOC-centrism into the structure of competition. Many would object, so I don't think that it would ever realistically be adopted.

      In the abstract, possible flukes are probably a bigger deal than regional/national variation.

Comments are closed.