Revealing Discrimination By Concealing Applicant Names During Evaluations

by Haruka Uchida

Disparities in evaluation outcomes are pervasive, raising concerns that some candidates are systematically disadvantaged due to discrimination. This has fueled debates over whether concealing candidate names during evaluations, also known as “blinding,” is an effective solution. If evaluators use candidate identity information to discriminate, then hiding this information may improve outcomes for traditionally disadvantaged groups. However, blinding may also conceal useful information, leading to the acceptance of lower-quality candidates.

Whether blinding induces a representation-quality tradeoff depends on how evaluators use the information conveyed by names. If evaluators use the information to cater to biased preferences, then blinding can simultaneously increase representation and improve the selection of high-quality candidates. If names instead serve as informative signals of quality, blinding may shift representation at the expense of quality. These competing mechanisms lead to a fundamental question: why do disparities in evaluations exist in the first place?

I answer these questions in my job market paper by implementing two field experiments in the review process of a major international academic conference in the field of computational neuroscience. By combining the experimental variation with a model of how reviewers assign scores, I quantify how different forms of discrimination contribute to disparities.

Hiding author names benefitted early-career applicants and applicants from non-top-20 ranked institutions

How does blinding change reviewer decisions? The main challenge in answering this question is that in most settings, reviewers and applicants choose whether to participate in a blind process, so that differences across blind and non-blind regimes may reflect differences in who chooses to take part rather than the effect of blinding itself. I overcome this using two stages of randomization. First, each of the 245 reviewers was randomly assigned to see author lists (“non-blind”) or not (“blind”), which ensures that blind and non-blind reviewers have similar characteristics on average. Second, each of the 657 submitted papers was randomly assigned to both two blind and two non-blind reviewers. This allows for comparing how the same paper is judged under blind versus non-blind review.

For each assigned submission, reviewers receive a title, 300-word abstract, and 2-page summary, and assign a score from 1 to 10. I test how gaps in reviewer scores by applicants’ student status (students, post-PhD), affiliated institution rank (top-20, non-top-20), and gender (male, female) differ across blind versus non-blind reviewers. The applicant is the individual who submits the work and would present it if accepted. Because co-authorship is common in computational neuroscience, I also show the effects by co-author characteristics in my paper.

Figure 1 shows the average scores by applicant traits and whether the reviewer received author lists or not. Among non-blind reviewers, applicants who are students and from non-top-20 ranked institutions receive significantly worse scores than their more senior, better-ranked institution counterparts. When the same submissions are scored by reviewers who do not receive author lists, score gaps by student status and institution rank shrink. Gender gaps are directionally reduced but not statistically significant. These changes suggest that reviewers do use author names when the information is provided to them.

Figure 1: Reviewer Scores and Effects of Blinding

(a) By Applicant Student Status
(b) By Applicant Institution Rank
(c) By Applicant Gender

Hiding author names did not significantly change the evaluation’s ability to screen on quality

Does blinding cause evaluators to change who receives favorable evaluations without sacrificing the ability to screen for quality? The main challenge in answering this question is that in most evaluation settings, underlying candidate quality is not observed by the researcher.

Figure 2: Effects of Blinding on Quality

(a) Number of Citations
(b) Journal-Weighted Publication Status

To answer this question, I track each submission for five years after the conference, collecting proxy measures of its quality: citation and publication statuses. Figure 2 plots the relationship between a paper’s percentile rank in quality and its percentile rank in blind and non-blind scores. I find that a paper’s blind score is as good of a predictor of its citation and publication measures as its non-blind score. Papers that would be accepted under blind review have comparable citation and publication measures as those that would be accepted under non-blind review.

Using blind evaluations to disentangle forms of discrimination

Why does blinding change representation of selected applicants without changing quality? What underlying forces drive gaps in evaluation outcomes in the first place?

To answer these questions, I formulate a model of how non-blind reviewers assign scores to submissions using their content and author traits. I then estimate this model to decompose disparities into four distinct forms of discrimination:

  • accurate statistical discrimination (Phelps, 1972; Arrow, 1973; Aigner and Cain, 1977), where reviewers use author group memberships to accurately update beliefs about underlying paper quality
  • inaccurate statistical discrimination (Bordalo et al., 2019; Bohren et al., 2019; Coffman et al., 2021), where reviewers rely on inaccurate beliefs about paper quality
  • pursuit of alternative objectives beyond paper quality, such as favoring applicants whose acceptance would benefit others the most
  • all other determinants of disparities, including taste-based discrimination and animus (Becker, 1957)

The thought experiment motivating the decomposition is the following. Consider two submissions with identical content but submitted by authors with different traits. To what extent can differences in reviewer scores be attributed to differences in true quality, differences in reviewers’ misbeliefs about quality, or reviewers’ beliefs about alternative objectives?

Estimating this model requires overcoming two main challenges. First, distinguishing between actual and perceived quality differences requires observing reviewer beliefs. To address this, I run a second experiment with the same conference, directly eliciting reviewers’ beliefs during the review process about submission outcomes (future citation and publication status) and alternative objectives (e.g., how much the applicant’s acceptance would benefit others).

Second, identifying discrimination requires accounting for group differences in submission content. Otherwise, differences in outcomes may reflect variation in content rather than discrimination. This is a longstanding challenge because oftentimes evaluators may base decisions on aspects of a submission’s content that researchers do not have data on. I address this comparability issue by using blind scores as a proxy for submission content, given that blind reviewers assign scores using only submission content, without receiving author names.

Figure 3: Decomposing Non-Blind Score Gaps

Figure 3 presents the decomposition results. I find that the underlying forms of discrimination driving disparities in reviewer scores differ across traits. The score gap by student status can be entirely explained by two channels: reviewers hold overly pessimistic beliefs about the quality of papers submitted by students, and value alternative objectives such as talk quality, which they believe is worse for students than more senior applicants. In contrast, the score gap by institution rank is not explained by these channels and is instead consistent with a preference for applicants from top-ranked institutions (or animus against those from non-top-20 ranked institutions).

In sum, the efficacy of blinding depends on why disparities exist in the first place. My experiments show that blinding can shift representation without compromising the ability to screen on quality. My model decomposition helps explain why: the mechanisms that generate changes in representation can offset each other in their effects on quality. More broadly, the decomposition demonstrates how data from blind evaluations can be leveraged to quantify the sources of disparities that exist in the absence of blinding.

These insights, and the methodology developed in my paper, extend beyond academic review processes. Many policy-relevant evaluation settings, such as job hiring, face potential tradeoffs between information, representation, and quality. Understanding the mechanisms driving disparities therefore remains essential for designing fair and effective evaluation systems.

About the Author

Haruka Uchida a PhD candidate in the Department of Economics at the University of Chicago.

Her research focuses on labor economics, education, and experimental economics. To learn more about her work, visit her website: https://www.harukauchida.com/

Social Media Handles:

PAGE TOP