• Kathi-Sue Rupp

Judging Noise

As we watch the Tokyo Olympics from afar, there is sure to be discussion about the judging. Every score will be looked at with scrutiny and analyzed by officials and fans alike. They will look for judges showing favoritism to gymnasts based on the country they represent (Nationality Bias), their fame as a gymnast (Reputation Bias), or when they are competing relative to other gymnasts (Order Biases). But while bias gets a lot of attention, I would like to talk about a lesser known cause of inaccurate judging—noise.



What is Noise?


Noise is a concept familiar to statisticians. It is a random variation in evaluations by different people, or at different times. Noise is everywhere and just as common as bias. Furthermore, noise can cause inaccuracies as much as biases do. The difference is that biases have a systematic skew in a certain direction, but noise is a random variation. There isn't a strong pattern to noise like there is with a bias.


We pay attention to bias because the most egregious biases are conscious and intentional. Some are forms of cheating and bring about unjust results. Even unconscious biases (e.g. memory bias or order bias) can affect the rankings and send the wrong person home with the gold medal. But noise can be equally responsible for inaccurate scores. So let’s take a look at how it affects gymnastics judging.


Noisy People


Human brains have evolved to discern patterns. We even try to find patterns where they do not exist. As such, we are drawn to finding patterns in judges' scores and speculate as to their cause. We like simple, quick rationales for events to fit into patterns we have seen before. Similarly, we try to make even random variations fit in with the patterns we know to exist. But sometimes the random variations are exactly that, random. Those random variations are noise. Noise is all around us and is particularly prominent in predictions and judgments. But because noise is less explainable (and less sensational than its often scandalous cousin, bias), we don't take notice of it, or even realize it occurs.


Judges are human, and humans make mistakes. Just as the gymnasts have errors in their routines, the judges have errors in some of their evaluations of those routines. Hopefully those errors will be few, as the judges at the Olympics are ranked as the best in the world, but judging is a complicated process and there are bound to be occasional errors. That is one reason why there are multiple judges performing each judging function. It reduces the odds of having a random error affect the competition outcome. Even with these measures, though, noisy variations in judgment still occur.


To counter the possibility of judging noise affecting competition results, the highest and lowest execution scores are dropped and only the middle scores are averaged to calculate the gymnast’s final score for a routine. It is likely that individual random errors will be outlier scores and therefore not counted. Eliminating the high and low outliers also helps to eliminate biased scores by individual judges. Of course, judges aim to be among the counting middle scores contributing to the gymnast's final score. Consequently, when a judge notices that their score is off from the panel average, the natural reflex is to make an adjustment on the next routine so that the subsequent score will be more in line with anticipated the panel average. The subsequent bouncing of scores as the judge tries to calibrate to the panel isn’t a bias, the bounce is noise. (This type of score bounce frequently happens with less experienced judges, and still occasionally occurs even at higher levels.) However, we are not just noisy as individuals, we are noisy as groups.


Group Noise


It is well-researched that bias can occur both at the individual and group level. Biases such as reputation bias, or within-team order bias play on human nature. From a young age, humans are taught to anticipate future events and outcomes (e.g. if you touch that hot stove, you will get burned). So when a judge sees a well-known gymnast perform, they may have expectations for how well that gymnast will perform, which influences what the judge observes in the performance (i.e., reputation bias). Similarly, if a judge has reason to expect that the gymnasts have been placed in an order from poorest to best, their expectation for superior performance can influence how they score what they observe (i.e., within-team order bias). These types of bias are common and generally unintentional, and while not every judge may fall prey to them, many do. Hence, we sometimes see the score from an entire panel of judges skew towards some of these biases. That is what bias is: a variation which is inclined to go in a particular direction. Because biases play on human nature, they can occur in groups.


Just as bias can occur on an individual and group level, so can noise. Qualifying sessions of large competitions are quite lengthy. As the day goes on the judges get tired, and with fatigue comes a greater possibility for errors. (Certain biases are also more prevalent in fatigued states.) Multiple judges making multiple errors as they all grow weary and lose the ability to concentrate can cause inaccurate scores. The tired judges won't necessarily make the same errors, but the panel becomes noisy with errors as a group.


Another example is the difference between judges accurately discerning the angle of certain gymnastics elements based on where they are seated. Each judge has a slightly different point of view. What looks like 29° from one judge's chair might look like 31° from another, causing a difference of -0.2 in the deductions they take for the same error. Then take it one step further: let's say that two judges on one side see an angle as 29° and two judges on the other side see it as 31°, and that this happens twice in the course of the routine. Two of the judges will be nearly half a point off from the other two judges from this noisy error. If then these judges have a noisy bounce, like the one described above, on the subsequent routine, each of the judges will compensate for being too high or too low and the scores reverse on the next routine with who is high and who is low. Many judges are probably nodding their heads right now, because nearly every judge has seen this happen. These are common judging occurrences. They are reasonable human errors. It isn't bias. It is noise, and it can cause inaccurate scores just as much as bias can.


Just Noise


As much as bias and noise can lead to incorrect scores, there are also instances on judging panels when a noisy outlier score isn’t an error, but is the score that is the most correct. When bias is occurring on a panel of execution judges at a group level, an individual judge who does not succumb to the bias may indeed have the most accurate score of the group. To illustrate how this can occur, let’s consider conformity bias. Conformity bias is the tendency to align one’s behavior to those around them. In judging we see this on a panel of E judges as judges make adaptations to the score they think is most accurate in order stay in line with the panel average. Judges are evaluated by how close their scores are to the panel average, therefore as mentioned above, most judges try to avoid being among the outlier scores which will be discarded.


In general for experienced judges, the average score for the average routine is usually reliable and correct. However, when there are outlier routines, those routines that are exceptionally good (or exceptionally bad), the judges must extend outside of the comfort zone of the typical range of scores they normally give. In doing so, they increase their risk of becoming an outlier. A judge will question whether the rest of the panel will be willing to award such a high (or low) score, and many will make an adjustment to award a score closer to what they anticipate the panel average will be. The result is that the very best routines seemingly get nit-picked for deductions that are overlooked in routines with more egregious errors (or a very poor routine is shown mercy). In such instances, the judge who does not conform to the anticipated panel average puts themself at greater risk for being a discarded outlier. Their score would still appear as noise, but it may indeed be the most objectively accurate score on the panel.


Quieting Noise


Just as there are things judges can do to reduce the likelihood of falling prey to bias, there are things judges can do to quiet their noise. Of course well-prepared, competent judges make fewer errors in general than less experienced judges. However, just being aware that noise exists and knowing some of the ways noise occurs can help judges to be more on guard against it. Since being tired amplifies noise, taking precautions to fend off fatigue can help reduce noise, such as: being well rested, staying hydrated, standing up and being active when possible between rotations or sessions, monitoring caffeine intake to avoid a caffeine crash as the caffeine wears off, and likewise taking precautions to avoid an afternoon blood-sugar imbalance.


Competition organizers can help reduce the possibility for noisy errors by having short sessions with a limited number of gymnasts and not mixing multiple levels of gymnasts in the same session. Having multiple levels in the same session not only increases the likelihood of errors, the additional mental energy a judge must expend each time they switch between different sets of rules results in fatigue setting in sooner, thereby increasing the possibility of more mistakes. And while it is not always practical at smaller competitions, having multiple judges perform the same judging task (a judging pair or panel as opposed to an individual judge) can also help to double check and catch errors.


Reduction of noise is an ongoing process. Post-competition, perform a noise audit, either individually or for the panel of judges. Look to see where and when the outliers occur and, if possible, video review for when the outlier may have been the score which was most correct. This review can help bring a greater awareness for exactly how noise has occurred.


Judging Noise


As we watch the Olympics and other competitions, it is easy to blame seemingly inaccurate scores on bias. Sometimes it might be judging bias, but sometimes it is just—noise.




References:


Boen, F., van Hoye, K., Vanden Auweele, Y., Feys, J., & Smits, T. (2008). Open feedback in gymnastic judging causes conformity bias based on informational influencing. Journal of Sports Sciences, 26(6), 621-628. https://doi.org/10.1080/02640410701670393


Bruine de Bruin, W. (2005). Save the last dance for me: unwanted serial position effects in jury evaluations. Acta Psychologica, 118(3), 245-260. https://doi.org/https://doi.org/10.1016/j.actpsy.2004.08.005


Dallas, G., & Kirialanis, P. (2010). Judges' evaluation of routines in men artistic gymnastics. Science of Gymnastics Journal, 2 (2), 49-57.


Kahneman, D. (2011). Thinking, fast and slow. Macmillan.


Kahneman, D., Sibony, O., & Sunstein, C. R. (2021). Noise: a flaw in human judgment. Little, Brown.


MacMahon, C., & Plessner, H. (2007). The sport official in research and practice. In Developing sport expertise (pp. 194-214). Routledge.


Plessner, H. (1999). Expectation Biases in Gymnastics Judging. Journal of Sport and Exercise Psychology, 21(2), 131-144. https://doi.org/10.1123/jsep.21.2.131


Plessner, H., & Haar, T. (2006). Sports performance judgments from a social cognitive perspective. Psychology of Sport and Exercise, 7(6), 555-575. https://doi.org/https://doi.org/10.1016/j.psychsport.2006.03.007


Plessner, H., & MacMahon, C. (2013). The sport official in research and practice. In Developing Sport Expertise: Researchers and Coaches put Theory into Practice (pp. 71-95). Routledge.


Ste-Marie, D. (2003). Expertise in sport judges and referees: Circumventing information-processing limitations. Expert performance in sport: Advances in research on sport expertise, 169-190.

Ste-Marie, D. M., Valiquette, S. M., & Taylor, G. (2001). Memory-influenced biases in gymnastic judging occur across different prior processing conditions. Research Quarterly for Exercise and Sport, 72(4), 420-426.


Recent Posts

See All