
The Blind Spot in the Machine: What 25,500 LLM Evaluations Reveal About AI Hiring Bias

In a 2023 study published in Scientific Reports, researchers at the University of Deusto in Spain found a troubling pattern in how people work with artificial intelligence. When participants used an AI tool with a built-in error, they continued to make the same biased choices even after the AI was removed entirely. As reported by Scientific American, 80% of these people noticed that the AI was making mistakes, yet they still copied its biased decisions. The bias did not just stay inside the software; it rubbed off on the humans who used it.
This cognitive contagion is exactly why we must look more closely at how artificial intelligence affects decision-making. When we use AI models to generate images or sounds, it is quite easy for a trained person to spot the bias and change the prompt. The cover image of this article shows a good visual example of how Google Gemini fills in the blanks. Every image was generated with the same simple prompt: "Generate me an image of a person at work," with a specific nationality added. I did not fill in any other details, such as location, gender, or profession. Just looking at the resulting images, most people can easily guess the nationality.
This is a fun way to test a model and see how it interacts with real-world stereotypes. However, when these models are used in a professional environment, it is easy to forget about this bias. We often assume that model answers are objective and correct, and we use them to make decisions that affect businesses and real people's lives. Bias is everywhere. It is easy to spot in pictures, but it is much harder to see when we use models to evaluate things that must go through a personal filter. For example, it is impossible to give a perfect score to a resume because there is no single right answer. I noticed how hard it was for a model to evaluate my pictures during my previous project, which is why I wanted to look deeper into resume screening.
For this experiment, I took the actual resume I used to apply for my job at re:cinq. I asked ten different large language models to evaluate how well my resume matched seventeen job descriptions I took from LinkedIn and anonymised. Through this work, I want to show that bias is not just about giving penalties to candidates. It is also about artificially boosting certain profiles. When a model unfairly boosts a candidate, a company risks hiring someone who is not actually a good match for the job.
The danger of this technology is that you never hear from the false negatives. The qualified people who an AI model rejects do not reapply, do not sue, and do not appear in your HR dashboards. The bias remains invisible because the people it harms are the least visible to you. If you wait for complaints to show you the bias in your system, you will wait forever.
Furthermore, this bias does not stop at your company's front door. The same kinds of language models are increasingly used to write interview preparation guides, promotion recommendations, compensation benchmarks, and performance review summaries. A small, invisible disadvantage at every step of an employee's career will compound over time. You do not need a single decision to be terrible for the final, cumulative outcome to be deeply unfair.
Using these tools is also becoming a major legal risk. The European Union AI Act classifies AI systems used in recruitment and human resources as high-risk, which carries heavy financial penalties for non-compliance. In the United States, New York City already requires annual independent bias audits of automated employment decision tools, and federal regulators such as the Equal Employment Opportunity Commission are applying the same strict rules to algorithmic screening as to traditional discrimination. In May 2025, a federal judge in California granted collective-action status in the case of Mobley v. Workday, allowing a massive lawsuit to proceed on behalf of applicants who argue that automated screening discriminated against them. Saying "we trusted the model" is no longer a valid legal defence for any business.
Hiring is the canary in the coal mine for algorithmic bias. It is the easiest place to measure bias because the input is structured and the output is a simple score. The same language model that changes its mind because of a name on a resume will do the same thing when reading a medical note, a loan application, a code review, or a content moderation case. If we can prove and measure the bias here, we have shown that it exists everywhere else, where it is much harder to test. My goal with this project is to show that this bias exists, demonstrate how easy it is to find, and help you understand that you must consider these errors when you analyse model results.
Explore the full results
All 25,500 evaluations are public. Filter by model, resume variant, and job description to see the bias for yourself.
View the interactive Hiring Bias Web App →What the Data Lets Us Say
We have completed our data collection. Our dataset contains 25,500 scored evaluations, representing ten models, thirty resume variants, seventeen job descriptions, and five repetitions per test. We also collected nearly 5,000 evaluations from a second-stage AI auditor using gemini-2.5-pro to judge whether the difference in score between a normal resume and a modified resume was justified, mixed, or biased. All of our raw findings are available for public inspection on the Hiring Bias Web App, and the full code is available in our Hiring Bias GitHub Repository.
Headline Findings
Almost half of the score differences we observed are flagged as bias by our independent AI auditor. Across the full audit of 4,930 evaluated pairs, the gemini-2.5-pro judge returned a verdict of 45.0% biased, 53.9% justified, and 1.1% mixed. In our smaller pilot audit using Anthropic claude-opus-4-7, we saw a lower bias rate of around 34%. This suggests that the newer judge is stricter or more adept at identifying the specific reasoning patterns that reveal demographic bias. Either way, the main takeaway is that nearly half of the identical resume pairs received different scores due to factors that the auditor concluded were tied to the demographic change rather than the actual work experience.
We also discovered that these audit verdicts are highly unstable across different runs. When the auditor was given two different sampled evaluation pairs from the same (variant × model × JD) cell, the final verdict disagreed 46% of the time, nearly half. (Live stat on the methodology page; download the raw audit-verdicts CSV and verify it yourself.) This shows that if you evaluate a model's bias based on a single test run, your conclusion will be brittle and unreliable. This is why our study aggregates five separate runs per cell at the default sampling temperature of 0.7, capturing the natural stochasticity these systems exhibit rather than pretending it doesn't exist.
A real example from our data shows how this silent bias works in practice. When evaluating a junior full-stack developer role, gemini-2.5-flash dropped its score by an average of 2.8 points across five runs when the applicant's school was changed from a local, lesser-known university to MIT. The baseline resume scored an average of 7.6 out of 10, while the prestigious MIT resume averaged 4.8. In the most extreme case, which was run 4 times, the baseline resume scored 9, while the MIT resume scored 4.
The AI auditor labelled both of these runs as biased with high confidence. When we looked at the explanations written by gemini-2.5-flash, the model never explicitly said that MIT was a bad school. Instead, it subtly rewrote its evaluation. In the baseline version, it praised the candidate's experience with geographic mapping. In the MIT version, it suddenly claimed that this same mapping experience was a concern because it was not directly related to renewable energy. This is a clear example of the silent bias mechanism. The model does not write anything openly offensive. Instead, it invents different justifications to lower the score for the same work history.
This highlights the important distinction between verbal bias and silent bias. Verbal bias occurs when the model explicitly mentions a demographic attribute in its explanation. Silent bias occurs when the model's written explanation appears completely neutral and professional, yet the numerical score still drops. Silent bias is far more dangerous because it is impossible to detect simply by reading the model's output.
How the Models Compare
One of the main questions we wanted to answer was which models are the most sensitive to demographic changes. By measuring the mean absolute change in score when we changed a single variable on the resume, we created a clear ranking of the ten models.
| Model | Mean Absolute Score Change | Mean Signed Score Change |
|---|---|---|
| qwen-3-next-80b | 0.405 | −0.396 |
| gemini-2.5-flash | 0.276 | −0.276 |
| gemini-2.5-pro | 0.243 | −0.221 |
| mistral-small-2603 | 0.229 | −0.198 |
| gemini-3.1-pro-preview | 0.110 | −0.063 |
| claude-sonnet-4-6 | 0.101 | −0.032 |
| claude-haiku-4-5-20251001 | 0.101 | +0.014 |
| claude-opus-4-7 | 0.084 | −0.041 |
| mistral-large-2512 | 0.072 | −0.062 |
| llama-4-maverick | 0.068 | +0.016 |
There is a sixfold difference in demographic sensitivity between the most sensitive and least sensitive models in our test. qwen-3-next-80b was the most sensitive to resume modifications, with an average change in score of 0.405. On the other end, llama-4-maverick was the most stable, with an average change of only 0.068. We noticed a very clear cluster of five models, including all three Claude models, llama-4-maverick, and mistral-large-2512, which remained highly stable under these modifications.
Interestingly, a model being a flagship release does not automatically make it fairer. While the Claude models and mistral-large-2512 sit in the stable cluster, the older Google Gemini 2.5 models were highly sensitive. The newer gemini-3.1-pro-preview is much closer to the stable group, which suggests that Google's latest updates have improved stability rather than revealing a persistent brand-level problem.
Additionally, the mean signed score change is almost always negative across our tests. This means that whenever we changed a demographic variable on the resume, the score almost always went down rather than up. The only exceptions were claude-haiku-4-5-20251001 and llama-4-maverick, and their positive changes were extremely small. This proves that bias in resume screening primarily acts as a penalty for the candidate, rather than a helpful boost.
What Triggers the Most Bias?
We also analysed which specific parts of a resume most strongly affect the score. By aggregating our findings across all models and job descriptions, we calculated the average change in score for each modified attribute.
| Modified Resume Attribute | Mean Absolute Score Change | Mean Signed Score Change |
|---|---|---|
| First Name | 0.272 | −0.255 |
| Career Gap | 0.251 | −0.233 |
| Anonymise (Redacted Version) | 0.179 | −0.142 |
| Company Locations | 0.178 | −0.157 |
| Graduation Year | 0.134 | −0.049 |
| Company Names | 0.128 | −0.054 |
| Address Country | 0.127 | −0.071 |
| School | 0.070 | −0.017 |
Swapping the candidate's first name to reflect different ethnic and cultural backgrounds caused the single largest shift in scores, with an average change of 0.272. This is the most damning piece of evidence in our study. A candidate's name contains absolutely zero information about their ability to do the job, yet changing it moved the score more than any other variable did. This is a direct echo of the famous 2004 field study by economists Marianne Bertrand and Sendhil Mullainathan, who showed that resumes with white-sounding names received 50% more callbacks than identical resumes with Black-sounding names.
A career gap was the second most sensitive attribute, with an average change of 0.251. What makes this finding notable is that our resume variant included a clear label explaining that the gap was due to caregiving responsibilities. Even with this explicit context, which should logically explain the time away from work, the models still penalised the candidate heavily.
Company locations were a surprisingly strong driver at 0.178, almost tied with anonymisation for the fourth-largest effect. The remaining attributes were much smaller, with the school name, company names, and the country of address ranging from 0.07 to 0.13. In our smaller pilot study, we believed that prestigious schools were major drivers of changes in scores. However, our larger dataset shows that they matter much less to the models than names, career gaps, and company locations do.
Graduation year sat in the middle of the pack, with an average score change of 0.134 and the smallest negative drop among the high-impact variables. This is a useful calibration point for our study. Graduation year is a legitimate proxy for years of experience, so some change in score is logically defensible. The fact that the models reacted moderately to this variable shows that they are not simply responding randomly to every edit.
The Myth of Simple Anonymisation
A common recommendation for reducing hiring bias is to simply remove the candidate's name from the resume. To test this, our experiment included an anonymisation arm with two distinct versions. The first was a name-blinded version where only the gender and ethnicity markers were removed. The second was a fully blinded version that removed names, employer names, schools, locations, and dates.
Our results show that blinding the resume shifted the final score by an average of 0.179, making it the third-most sensitive axis in our study. This is an important finding for companies designing hiring policies. It shows that hiding identity signals causes the model to change its score. As noted in the EDPB Bias Evaluation Report published by the European Data Protection Board, simply removing sensitive variables is rarely effective, as language models are highly skilled at identifying proxy variables that still reveal a candidate's background.
Our AI auditor evaluated these blinded runs to determine whether the score changes were driven by the model relying on hidden signals.
| Model | Name-Blinded Bias Rate | Fully-Blinded Bias Rate |
|---|---|---|
| mistral-small-2603 | 70.6% (12/17) | 70.6% (12/17) |
| gemini-2.5-flash | 52.9% (9/17) | 41.2% (7/17) |
| llama-4-maverick | 47.1% (8/17) | 23.5% (4/17) |
| gemini-2.5-pro | 35.3% (6/17) | 41.2% (7/17) |
| claude-opus-4-7 | 29.4% (5/17) | 35.3% (6/17) |
| gemini-3.1-pro-preview | 29.4% (5/17) | 35.3% (6/17) |
| claude-haiku-4-5-20251001 | 23.5% (4/17) | 41.2% (7/17) |
| mistral-large-2512 | 17.6% (3/17) | 35.3% (6/17) |
| qwen-3-next-80b | 17.6% (3/17) | 35.3% (6/17) |
| claude-sonnet-4-6 | 11.8% (2/17) | 23.5% (4/17) |
mistral-small-2603 represents an extreme outlier in this test. Removing candidate information changed its evaluation in over seventy per cent of cases. The auditor's written reasoning consistently showed that the model had been heavily anchored on the demographic or prestige markers before they were removed.
We also noticed a strange pattern where some models reacted more strongly to name blinding than to full blinding. For models like gemini-2.5-flash and llama-4-maverick, hiding only the name caused more score volatility than stripping all context. This likely happens because the fully blinded resume removes so much context that the model sees the resulting score shift as a legitimate reaction to a lack of detail, whereas name-only blinding forces the model to struggle with the missing piece of the identity signal. Reassuringly, the models that were highly stable in our main tests also exhibited the lowest bias rates during anonymisation, indicating that our sensitivity rankings are consistent across different testing methods.
Is It Systematic Bias or Just Random Error?
When people talk about AI bias, they usually imagine a system that is consistently and intentionally prejudiced against a specific group. However, our data suggest a more complicated reality. Different language models are biased in entirely different directions. One model might penalise a specific region, while another might boost it.
This supports a different framing of the problem. Much of what we call AI bias is actually just statistical noise and random mistakes encoded in the training data, rather than a coherent or unified ideology. The model is simply unpredictable. It makes random mistakes with massive real-world consequences for job seekers.
In 2018, Reuters reported that Amazon quietly abandoned an internal AI recruiting tool after discovering that it was systematically biased against women. The tool, which rated candidates from one to five stars, penalised resumes that contained terms such as "women's chess club captain" and downgraded graduates of two all-women's colleges. It had learned to copy the hiring patterns of the previous ten years, which were heavily male-dominated.
While that was a famous case from a giant technology company, the same thing happens at smaller companies today without ever making the news. One of the founders of our project saw this firsthand at a mid-sized tech company. If you train an automated screening tool on the profiles of people you have already hired, your model will simply learn to copy and reinforce whoever is already in the office. This is further exacerbated when companies demand that candidates write resumes with highly specific culture-fit keywords, which AI models then prioritise as a proxy for talent.
Limitations
To keep our research credible, we must be honest about our limitations. First, this experiment was built using a single baseline resume. This is a proof of concept, not a complete population study. Different resumes, industries, or job roles might show different levels of sensitivity.
Second, our Claude tests were run through the standard subscription interface rather than the official developer API, which means they used different default sampling settings. This difference must be kept in mind when comparing the Claude models to others in our tables.
Third, our AI auditor is itself a language model. The gemini-2.5-pro auditor inherits whatever Google has trained it to consider as bias. A different judge model, such as an OpenAI model, would likely draw different lines. Furthermore, because we are using gemini-2.5-pro to judge other Gemini models, there is a minor risk of self-judging bias, as models have been shown to favour their own family's output style. We chose gemini-2.5-pro because it offered the best balance between reasoning quality and cost, fitting our tight budget of roughly $31 in API charges for the full audit.
Finally, our fully blinded resume variant had to remove years and dates to protect the candidate's age. This naturally means that information about the candidate's total years of experience was also lost, which is an unavoidable confounding factor when evaluating why the scores changed for that specific variant.
Conclusion
As re:cinq co-founder Pini Reznik noted during our team discussions, the central question we must ask ourselves is simple: "It is not about models being biased or not. It is about awareness." We must ask ourselves whether we are truly aware of the bias a model brings to our workflow, and whether that bias is one we are willing to accept.
Large language models are highly complex, expensive products. It is currently impossible for an average company to audit the massive training datasets used by these models, let alone build their own custom foundation models from scratch. While you can fine-tune open-source models with your own data, this still requires significant engineering resources and adds to your business costs.
If you are using an applicant tracking system that includes AI features, you must find out exactly where and how those models are being used. If you can, ask for direct access to the prompts used to evaluate your candidates. If you are writing your own evaluation prompts, test them thoroughly. You must run the same resume through your system multiple times because language models are built on statistical and random processes. A single test is never enough to trust the result.
AI is a brilliant tool for parsing natural language, summarising text, and identifying hidden patterns. However, the moment we ask an AI to make open-ended decisions about human capability, it will inject its own training errors and silent biases into the process. We must stop treating AI scores as objective truth and start treating them as highly subjective, unpredictable opinions.
Table of Contents
What the Data Lets Us Say
Headline Findings
How the Models Compare
What Triggers the Most Bias?
The Myth of Simple Anonymisation
Is It Systematic Bias or Just Random Error?
Limitations
Conclusion
Continue Exploring
You Might Also Like
A Pattern Language for Transformation
Browse our interactive library of 119 transformation patterns. Each one describes a specific architectural problem and a tested way to solve it, so your team can talk about real tradeoffs instead of abstract ideas.






