Calibration in Gymnastics Judging
- Kathi-Sue Rupp
- Apr 12
- 8 min read
We often hear gymnastics judges talk about the need for calibration when they judge.
But what exactly is calibration, why do judges need to calibrate, and what are effective ways to calibrate?

Calibration refers to developing an internal scale during a series of judgments. Calibration is essential to ensure uniform judgments throughout the course and across competitions.
A calibration process occurs as gymnastics judges settle into a competition. It happens as initial decisions are made, and judges acclimate to the other judges on the panel and the level of performance being presented. Calibration is how judges adjust to the level of competition and the panel of judges they are judging with.
Calibration Occurs on Two Levels
For gymnastics judges, calibration must happen on two levels: 1) an internal calibration of oneself for internal consistency (i.e., to assure that they, as an individual, rate performances on a uniform and consistent scale), and 2) an intra-panel calibration with the other judges on the panel (i.e., to assure that the judges within a panel give similar ratings for each performance). The internal calibration aligns the judge’s existing knowledge and preparation with the expected level of competition. It allows them to make adjustments for variables particular to that competition (e.g., a competition with many similar high-caliber gymnasts requires precise judging to differentiate between athletes, or a lower-level competition may be judged with more leniency and compassion). The intra-panel calibration brings about a more uniform evaluation between judges on the panel, striving for agreement in the judgments being made.
Fasold et al. (2012) examined the internal calibration aspects of the judging process and determined that in order to calibrate internally, judges tend to avoid extreme judgments at the beginning of a series, because early extreme judgments (both high and low) would limit degrees of freedom for subsequent judgments (e.g., withholding extremely good scores early on to “leave room” for a potentially better performance later on). This process can produce an overall order bias where good performances early in a competition receive harsher evaluations than they would have if they occurred later in the line-up. It also results in a suppression of scores into a narrow range, particularly at the start of a competition, until the extremes have been effectively calibrated.
On a panel level, judges need to get the feel for the internal calibration of the other judges on the panel and align their own internal scale with the others with whom they will be judging. The calibration process is not actually finished until routines have been evaluated at the extremes and the judges have a feel for how “brave” (or conservative) the panel will be in its recognition of gymnastics excellence. Otherwise, it is common to have the best routines nit-picked for errors that are overlooked in routines with more egregious faults.
The results of the calibration process should be consistent and aligned judgments for the evaluation of a competition, and even throughout the season and cycle, where gymnasts are judged on an even scale.

How to Calibrate Effectively
“If you fail to plan, you are planning to fail.”
~Benjamin Franklin
One purpose of judges’ meetings before a competition is to help the judges calibrate on both an individual and intra-panel level. However, most judges’ meeting materials are rarely optimized for effective calibration. Here are some ways to improve the calibration process:
Key Points
Calibration needs to occur before the competition begins.
Judges need feedback during the calibration process.
Calibration needs to happen with full routines at the extremes (the very best and the very worst), and not just evaluating the performance of individual elements.
Effective calibration requires building trust among the members of a judging panel.
Pre-competition Calibration
Calibration needs to occur before a competition begins. Otherwise, it will occur during the first few routines of the competition, which is unfair to the gymnasts competing early on. At a minimum, allowing judges to view podium training or warm-ups permits the judges to view the potential range of performance. Individual judges can then feel more confident to reward excellence early on. In-competition calibration behaviors (i.e., calibrating during the early part of a competition) arise from an information-poor context as officials adjust for an otherwise unknown level of play.
Provide Feedback During the Calibration Process
Pre-competition calibration can begin long before the competition, or even the season starts. However, feedback during the calibration process is essential. Without detailed feedback, judges are just blindly conforming to a dictated standard, which is known as normative conformity. Alignment based on normative conformity will produce an inconsistent, temporary alignment at best because it is not rooted in understanding, but merely a blind adherence to an imposed standard in order to fit in at that moment.
In an ambitious effort to improve uniformity of NCAA judgments across the USA throughout the entire season, the USA Men’s National Judges Association (NGJA) runs an annual national pre-season calibration exercise called the "NCAA Challenge."
An important part of this process is that the judges have detailed feedback on how their scores might differ from what an expert panel has determined to be an accurate evaluation of the exercise. The judges can compare their notes to those of the expert panel and even review the routines again to verify where they differ. This process, which happens outside of the competition arena, is more likely to lead to an informed long-term adjustment, based on learning, as opposed to a normative conformity, which is temporary at best.
Likewise, the FIG MTC E jury examples (available on STS) are meant to help judges calibrate for the FIG Judges’ exam and judgment throughout the cycle. However, many of those examples lack detail, making it difficult for judges to understand when and how they deviate from the expected norm. As such, any adjustments the judges make are rooted in normative conformity, not an accurate, informed understanding. Changes in judges’ behavior due to normative conformity are temporary, superficial changes for the sake of passing the exam that are not accompanied by an actual change in one’s beliefs and judging habits.
Calibrate to the Extremes, Not an “Average” Performance
Since calibration is an avoidance of extreme judgments, judges cannot effectively calibrate to an average quality routine. They need to calibrate to routines at the extremes, particularly on the extremely good end of the scale, while also establishing what an appropriate “mercy score” would be for the given competition.
In a subsequent experiment, Fasold et al. (2015) determined that observing examples of potential extreme performances (i.e., the potential best and worst) immediately before a series of evaluations was an effective and efficient method of calibration for handball coaches. Coaches who were shown the best and worst performances from the previous year’s test, before evaluating a series, felt confident to give more extreme ratings as warranted, even at the start of the series. Whereas coaches in the control group, who did not complete the pre-evaluation calibration exercise (i.e., did not view the prior year’s best and worst performances), avoided extreme judgments during the early evaluations in the series.
The same type of pre-competition calibration process could likely optimize the calibration process for gymnastics judges too, particularly for the upper end (best) performances. It is essential to help judges differentiate between very good, excellent, and even perfect performances. I stress the need to focus more on the excellent end of the spectrum because those are the performances that will contend for medals. Ineffective calibration at the upper end can result in an uneven application of the rules, resulting in a gymnast winning due to more difficulty because the execution panel was not calibrated well enough to reward excellent execution on an even, consistent scale.
Additionally, it is vital in gymnastics that the calibration process includes scoring entire routines, not just individual elements, and calibrating with whole routines at the extremes. The standard current measure of gymnastics judging accuracy is how well a judge’s score for a routine aligns with the scores of the other judges on the panel for a given routine. Judges are not evaluated for their accuracy in judging individual gymnastics elements; they are evaluated by how their scores for entire routines relate to the other judges on the panel. However, at many pre-competition judges’ meetings, the focus is often exclusively on the evaluation of specific individual elements. The panel needs to calibrate to reward gymnastics excellence at the routine level and TRUST that the rest of the panel will do the same. Otherwise, individuals on the panel may still experience a fear of being the only member of the panel who sticks their neck out where it will be chopped off (i.e., their score is discarded) for being at the extreme end of the range relative to the rest of the panel.
Build Trust as a Panel in a Psychologically Safe Environment
All of the pre-competition preparation will go out the window if the members of a judging panel do not trust that the other judges will judge the same way on the competition floor.
It is believed that an objectively judged competition is one in which there is high agreement among judges. The judges’ performance is evaluated based on this agreement, introducing pressure on officials to normatively conform. In gymnastics judging, conformity bias manifests itself as an adaptation judges make of their own scores toward those of their judging colleagues. Even when judges are presented with a pre-competition judging standard, if a panel is decided, or drawn, just before a competition begins, there is no opportunity for the panel to build trust, and there is a question of what the panel will actually do on the competition floor. Can each judge trust that the other judges on the panel will adhere to the prescribed standard, particularly when it comes to the extremes?
In recent years, some judging organizations have provided pre-competition calibration meetings with the full judging panels for championship-level competitions. During these preparation meetings, judges are provided with a psychologically safe environment, without fear of repercussions for not aligning with the rest of the panel in this pre-competition setting. It is a time to discuss and resolve judging discrepancies outside of a competition arena. Trust can be built among the judging panel members that gymnastics excellence will be uniformly recognized. That same level of trust cannot be built when a judging panel is drawn just before a competition begins, which is why routines judged at national championships (with panels who have prepared together) often score more to the extremes than comparable routines at international competitions (where no such trust can be built).
Summary: What Can be Done
Assign judging panels early enough so that judges can calibrate in the time leading up to a competition
Make sure the competition schedule permits judges to watch podium training or warm-ups
Head judges need to provide detailed feedback to panel judges in a psychologically safe pre-competition setting
Organize and participate in pre-competition and pre-season calibration exercises (like the NCAA Challenge) that provide detailed feedback.
A well-structured, systematic, and optimized approach to calibration is necessary to ensure a consistent, uniform judging standard is applied in a competition setting. Effective calibration will substantially contribute to the accuracy of evaluations and consistency of scores both within individual competitions and across a competition season.
What effective calibration techniques have you encountered?
Share it using the feedback form below!
Resources:
Boen, F., Van Hoye, K., Vanden Auweele, Y., Feys, J., & Smits, T. (2008). Open Feedback in gymnastic judging causes conformity bias based on informational influencing. Journal of Sports Sciences, 26(6), 621-628.
Boen, F., Vanden Auweele, Y., Claes, E., Feys, J., & De Cuyper, B. (2006). The impact of open feedback on conformity among judges in rope skipping. Psychology of Sport and Exercise, 7, 577-590.
Deutsch, M., & Gerard, H. B. (1955). A study of normative and informational social influences upon individual judgment. Journal of Abnormal and Social Psychology, 51, 629-636.
Fasold, F., Memmert, D., & Unkelbach, C. (2012). Extreme judgments depend on the expectation of following judgments: A calibration analysis. Psychology of Sport and Exercise, 13, 197-200.
Fasold, F., Memmert, D., & Unkelbach, C. (2015). A theory-based intervention to prevent calibration effects in serial sport performance evaluations. Psychology of Sport and Exercise, 47-52.
Heiniger, S., & Mercier, H. (2021). Judging the judges: evaluating the accuracy and national bias of international gymnastics judges. Journal of Quantitative Analysis in Sports, 17(4), 289-305.
MacMahon, C., & Mildenhall, B. (2012). A practical perspective on decision making influences in sports officiating. International Journal of Sports Science & Coaching, 7(1), 153-165.
Myers, D. G. (2010). Social Psychology. New York: McGraw Hill.
Scheer, J. K., Ansorge, C. J., & Howard, J. (1983). Judging bias by viewing contrived videotapes: A function of selected psychological variables. Journal of Sport Psychology, 5, 427-437.
Vanden Auweele, Y., Boen, F., De Geest, A., & Jeys, J. (2004). Judging bias in synchronized swimming: Open feedback leads to non-performance-based conformity. Journal of Sport and Exercise Psychology, 26(4), 561-571.
Comments