EyeSift
Equity & ResearchJune 11, 2026· 15 min read

Stanford HAI AI Detector Bias: 61.22% TOEFL False Positives

Reviewed by Brazora Monk·Source checked June 11, 2026

A Stanford HAI study tested seven major AI detectors on TOEFL essays written by educated non-native English speakers without any AI assistance. The detectors flagged 61.22% as AI-generated; all seven detectors unanimously flagged 18 of the 91 TOEFL essays, and at least one detector flagged 89 of 91. This is not an edge case; it is the predictable output of a method that mistakes careful second-language writing for machine-like predictability.

Direct answer: AI detectors biased against non-native English writers

Stanford HAI's article AI-Detectors Biased Against Non-Native English Writers reports that seven detectors falsely classified 61.22% of TOEFL essays by non-native English writers as AI-generated. The same summary reports 18 of 91 TOEFL essays were unanimously flagged by all seven detectors and 89 of 91 were flagged by at least one detector.

Use this page for queries about Stanford HAI, non-native English false positives, TOEFL essay detector bias, and why AI detector scores should be treated as triage instead of standalone proof.

Key Takeaways

  • Stanford HAI found 61.22% of TOEFL essays falsely flagged as AI-generated across seven major AI detectors — compared to near-zero false positive rates on native English student essays.
  • The mechanism is structural, not incidental. Non-native English writers naturally use lower perplexity language — the same statistical property exploited by AI detectors as their primary signal.
  • Do not turn one benchmark into a universal multiplier. Stanford's TOEFL result is severe, Berkeley D-Lab frames the equity risk, and Turnitin says its own current detector did not show statistically significant ELL bias; every detector still needs local validation before use in high-stakes review.
  • Neurodivergent students face similar risks — repetitive phrasing and preference for familiar vocabulary produce the same detection signatures as ESL writing patterns.
  • The EU AI Act still treats education-related AI systems as a high-risk category, and the May 2026 EU agreement set later high-risk application dates: December 2, 2027 for stand-alone high-risk systems and August 2, 2028 for product-embedded high-risk systems.

Primary result

61.22%

Average false-positive rate on TOEFL essays by non-native English writers across seven AI detectors.

Unanimous flags

18 / 91

TOEFL essays were labeled AI-generated by all seven detectors in the Stanford HAI summary.

Any detector

89 / 91

TOEFL essays were flagged by at least one detector, showing why cross-tool disagreement does not equal proof.

Source snapshot: Stanford HAI and arXiv

Source checked June 11, 2026. The exact 61.22% statistic comes from Stanford HAI's May 15, 2023 summary of GPT detectors are biased against non-native English writers, by Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou. Berkeley D-Lab is useful for equity framing, Turnitin's current instructor guide warns that AI reports should not be the sole basis for adverse action, and the European Commission/Council timeline changed in May 2026.

June 2026 relevance check: a June 10, 2026 New Yorker article about AI-authorship accusations around the Commonwealth Short Story Prize cites detector scores as part of a broader controversy about whether literary style can be reliably classified as human or machine-written. That does not make every current detector equally biased; it reinforces the practical rule on this page: a detector flag can start review, but should not end it.

Consider the following timeline.

July 2023

Stanford HAI publishes “GPT Detectors Are Biased Against Non-Native English Writers,” led by James Zou's research group. It tests seven popular AI detectors against TOEFL essays — writing by educated non-native speakers without any AI use. Result: 61.22% false positive rate. All seven detectors unanimously flag 18 of 91 essays.

2023–2024

Turnitin deploys AI detection to more than 16,000 institutions globally. Universities begin conducting academic integrity investigations based on detection flags. Several documented cases of false accusations against international students emerge in UK, US, and Australian higher education.

Early 2025

Berkeley's D-Lab publishes “The Creation of Bad Students: AI Detection for Non-Native English Speakers,” documenting how detection bias systematically disadvantages international students in American universities. Twelve major universities disable or restrict Turnitin AI detection.

2026

The EU AI Act timeline is updated through the May 2026 simplification agreement. Education remains a high-risk area, but the announced application dates for high-risk rules move to December 2, 2027 for stand-alone systems and August 2, 2028 for product-embedded systems.

The problem has been known, documented, and peer-reviewed since 2023. It has not been fixed. Understanding why requires looking at the technical architecture of AI detection — and at why the fix is not a software patch but a methodological reckoning.

Why Perplexity Analysis Systematically Fails ESL Writers

To understand the bias, you first need to understand what AI detectors actually measure. The dominant approach is perplexity analysis — measuring how “surprised” a language model is by each successive word. Low perplexity means the text was highly predictable: each word was a statistically likely choice given what came before. High perplexity means the writer made unexpected, idiosyncratic choices.

The core detection hypothesis is: AI text tends toward low perplexity (models select high-probability tokens), while human text tends toward higher perplexity (humans make unpredictable, creative choices). The complementary metric is burstiness — the variance in sentence length and complexity. Human writing alternates between short punchy sentences and long complex ones. AI output flows more uniformly.

These signals work reasonably well when the human writer is a fluent native English speaker writing in an expressive, variable style. They fail systematically on ESL writers for a straightforward reason: a writer working in their second or third language naturally gravitates toward the same statistical properties as AI output.

When you write in a language you are still acquiring, you:

  • Reach for high-frequency, familiar vocabulary over expressive but risky choices
  • Construct syntactically predictable sentences to avoid grammatical errors
  • Avoid idiomatic expressions, colloquialisms, and creative language play
  • Use formulaic transitions and paragraph structures taught in grammar instruction
  • Repeat useful vocabulary rather than substituting synonyms (which require deeper lexical knowledge)

Every one of these natural ESL writing behaviors reduces perplexity and reduces burstiness. The same statistical signature emerges from completely different causes: in one case, a language model selecting high-probability tokens; in the other, a human writer making linguistically safe choices. A perplexity-based detector cannot distinguish between these two fundamentally different origins.

According to the Stanford HAI paper, this explains why the false positive rates on TOEFL essays (61.22%) were so dramatically higher than on essays by U.S.-born eighth-graders (near zero). TOEFL essays are, by definition, careful, clear, vocabulary-controlled English writing from writers deliberately avoiding risky choices. That is precisely what low perplexity looks like from a detector's perspective — regardless of whether a human or a machine produced it.

The Research Landscape: What Studies Have Found

Study / SourceKey FindingPopulation StudiedImplication
Stanford HAI (Liang et al., 2023)61.22% false positive rate on TOEFL essays across 7 detectorsNon-native English graduate writersDetectors systematically fail ESL populations
Berkeley D-Lab (2025)Frames AI detection as an equity risk for non-native English speakersInternational university students (US)Policies need disclosure, human review, and process evidence
Turnitin instructor guidance (2026)AI reports may misidentify human, AI-generated, or AI-paraphrased textTurnitin submissions and instructor review workflowsVendor guidance still requires human scrutiny before adverse action
Newer detector methodology claimsSome newer systems claim lower non-native-English bias than early GPT detectorsVendor and technical-report benchmarksStill validate on the institution's own writing samples before high-stakes use
Turnitin ELL statementTurnitin says its ELL evaluation did not show statistically significant biasTurnitin's own detector evaluationDo not extrapolate Stanford's seven-detector result to every vendor model without testing

The Turnitin caveat is still significant even though Turnitin disputes a statistically significant ELL-specific bias in its current detector evaluation. Its instructor guide says AI writing reports may misidentify human-written, AI-generated, and AI-paraphrased text and should not be used as the sole basis for adverse action. That is the practical governance rule: treat detector output as a review trigger, not as proof.

Who Is Most Vulnerable: The Five High-Risk Groups

1. International Graduate and Undergraduate Students

This is the group with the most documented risk and the highest institutional stakes. Universities with large international student populations — which includes the overwhelming majority of research universities in the US, UK, Canada, and Australia — may deploy AI detection at scale against writers whose language background changes the detector signal. Berkeley D-Lab frames this as an equity problem because international students can have less institutional familiarity, more severe consequences from academic integrity findings, and less access to writing support resources that document their process.

2. ESL Professionals in HR and Publishing Contexts

AI detection has migrated beyond academic settings. A significant and growing number of publishers, HR departments, and content agencies now screen submitted writing using AI detection tools. For non-native English speakers applying for writing positions, submitting editorial pitches, or providing work samples, the same bias applies. According to a 2025 survey of content agencies by the Content Marketing Institute, 34% now use AI detection as part of their contributor screening process — without specific disclosure to applicants and without protocols for handling elevated ESL false positive rates.

3. Neurodivergent Writers

Neurodivergent writers — including those with dyslexia, ADHD, autism spectrum conditions, and language processing differences — may use writing patterns that overlap with AI detection signatures for reasons unrelated to AI use. Researchers at Berkeley's D-Lab have documented that students with dyslexia often rely on formulaic sentence structures and familiar vocabulary (lower perplexity), while autistic writers may use consistent terminology and prefer explicit, structured expression over varied rhetorical style (lower burstiness). Neither population produces writing that AI systems generated — but both can produce writing that perplexity-based detectors misclassify.

4. Writers in Formal Technical Registers

For writers whose native language is not English, formal technical writing presents compounded risk. Technical registers already produce lower perplexity than expressive writing (standardized vocabulary, formulaic structure, passive voice). When the author is also an ESL writer applying these conventions, the resulting text can produce extremely low perplexity scores — flagging at very high rates even for completely authentic human writing. A Chinese-born medical researcher writing a clinical report in English faces the combined effects of formal register, non-native vocabulary choices, and ESL syntactic patterns.

5. Writers Who Received Legitimate Editing Assistance

Students and professionals who used writing tutors, campus writing centers, or grammar-checking tools before submission may have had their statistical fingerprint altered in ways that reduce perplexity — because good editing resolves idiosyncratic word choices toward cleaner, more standard patterns. For ESL writers who are already close to the detection threshold, editorial polish can push them over it. This creates a perverse incentive: ESL writers who seek legitimate writing assistance may face a higher detection flag rate than those who submit unedited drafts.

The Institutional Response: What Universities Are Doing

The documented evidence has prompted significant institutional rethinking. At least twelve universities — including UCLA, Yale, Vanderbilt, Johns Hopkins, and Northwestern — have either disabled Turnitin's AI detection feature or issued formal policies restricting its use as evidence in misconduct proceedings.

UCLA's Humtech center published a detailed technical analysis of AI detector limitations and recommended against Turnitin AI detection adoption across the UC system. The recommendation specifically cited equity concerns for international and ESL students as the primary driver. Lund University in Sweden has taken a disclosure-based approach rather than detection-based: rather than attempting to catch AI use, the institution requires students to disclose how AI tools were used in their process — a framework that avoids false positive risk entirely.

The Serials Librarian's peer-reviewed analysis of false positive patterns noted that institutions most likely to perpetuate harm are those that automate detection without human review, treat flagged content as presumptively guilty, and fail to disclose to students that AI detection is being used. The institutions most likely to handle detection responsibly are those that treat detection output as one signal among many, require human review before any adverse action, and actively communicate detection policies and limitations to all students.

The Regulatory Dimension: EU AI Act and ESL Bias

The EU AI Act adds a legal dimension to what has previously been primarily an ethical discussion. Education and vocational training are listed in Annex III as high-risk areas. The timeline has also changed: the European Commission's May 2026 update says rules for systems used in certain high-risk areas, including education and employment, will apply from December 2, 2027, while product-embedded high-risk systems will apply from August 2, 2028. The governance direction is still clear: consequential detector use needs documentation, bias monitoring, disclosure, and human oversight.

  • Accuracy documentation: Vendors must document accuracy across demographic groups — which would require explicit documentation of the ESL bias that currently exists in most commercial detectors
  • Bias monitoring: Systems must be monitored for discriminatory patterns, with documented processes for identifying and mitigating bias
  • Human oversight: Consequential decisions informed by AI systems must involve human review that cannot be delegated away
  • Transparency: Individuals must be informed when AI systems are used to assess them and have meaningful recourse to contest those assessments

For institutions in EU jurisdictions, deploying AI detection against student work without demographic bias documentation, without human oversight protocols, and without transparent disclosure is moving from an ethical risk toward a regulated-system risk. Institutions in non-EU jurisdictions may face similar expectations as the Act shapes global best practices, as GDPR did for data protection. Our AI regulation compliance guide covers how institutions should structure their AI detection governance frameworks.

What Can ESL Writers Do If Flagged?

If you are an ESL writer who has been flagged by an AI detector, the evidence is on your side — but you need to present it effectively. Here is what the research suggests works:

Document your process proactively. Dated drafts, browser history from research sessions, notes, and any communication with tutors or writing centers constitute process evidence that directly contradicts the algorithmic assessment. Keep this documentation from the start of any significant writing project.

Request cross-tool verification. Run your text through multiple independent tools — including free options. Divergent results between detectors (one flags, one clears) are strong grounds for inconclusive classification. EyeSift's free text analyzer can provide an independent assessment. The Stanford HAI research found that different detectors disagree substantially on ESL text — this disagreement is itself evidence of unreliability.

Cite the Stanford HAI research directly. The academic misconduct defense community has found that presenting peer-reviewed evidence of systematic ESL bias — specifically the Stanford HAI paper and the Berkeley D-Lab analysis — is an effective framework for contesting detection-based accusations. These are not arguments about the tool being imperfect; they are documented findings about the specific population to which the tool was applied.

Request that the detection report be contextualized. Turnitin's own guidance advises instructors to treat detection scores as indicators for further investigation, not as determinative findings. Major academic misconduct defense attorneys, including Nesenoff & Miltenberg, have documented that detection output alone is legally insufficient as primary evidence in misconduct proceedings.

Our comprehensive guide on AI detection false positives covers the full landscape of false positive risk and institutional response protocols.

What Institutions Should Do: A Framework for Responsible Deployment

The research is not that AI detection tools should be abandoned wholesale — it is that their deployment must be calibrated to their known limitations. For institutions, four practices constitute the minimum responsible standard:

Require human review before any adverse action. Detection flags should trigger a conversation, a request for process documentation, and contextual review by a human — never an automatic determination. This is not only ethical practice; under the EU AI Act, it will be a legal requirement for consequential applications.

Apply heightened caution thresholds for identified high-risk populations. Institutions with significant international student populations should establish lower action thresholds — or mandatory human review regardless of score — for populations where the documented false positive rate is substantially elevated.

Disclose detection use to students. Students who know their work will be screened can take proactive steps. Transparent disclosure is both ethically appropriate and, in EU jurisdictions, legally required. It also substantially reduces the power imbalance that makes false accusations so harmful: a student who knows the detection is happening and understands its limitations is far better positioned to contest a false positive than one who encounters it for the first time in an accusation meeting.

Evaluate tool performance on your actual student population. A detector's published false positive rate was measured on a controlled benchmark dataset. That dataset does not represent your students. Institutions should pilot-test detection tools on representative samples from their own population — including ESL samples — before deploying at scale in high-stakes contexts.

Frequently Asked Questions

Why do AI detectors falsely flag ESL writing?

The core mechanism is perplexity scoring. Non-native English writers naturally use higher-frequency vocabulary, simpler syntactic structures, and formulaic phrasing — all of which produce low perplexity scores. Low perplexity is also the primary statistical signal that AI detectors use to identify machine-generated text. The result is a systematic false positive bias: ESL writing looks statistically similar to AI output on the metrics detectors use, despite being entirely human-authored.

What did Stanford find about AI detectors and non-native English speakers?

Stanford HAI's landmark study, led by James Zou's research group, tested seven major AI detectors on 91 TOEFL essays written by non-native English speakers without AI assistance. The average false positive rate across the seven detectors was 61.22%. All seven detectors unanimously flagged 18 essays. Against native English student essays, the same detectors produced near-zero false positive rates. The study specifically recommended against using AI detectors in educational settings with large non-native English speaker populations.

How many TOEFL essays did the Stanford HAI study test?

The Stanford HAI summary reports 91 TOEFL essays by non-native English students. Seven AI detectors misclassified 61.22% as AI-generated on average, unanimously flagged 18 of the 91 essays, and flagged 89 of 91 essays with at least one detector. Those figures are the reason a detector score should be treated as a review signal, not as standalone evidence.

How much higher are false positive rates for ESL writers vs. native speakers?

The safest answer is source-specific. Stanford HAI found a sharp gap in its seven-detector test: near-perfect performance on U.S.-born eighth-grade essays versus a 61.22% average false-positive rate on TOEFL essays by non-native English writers. Berkeley D-Lab frames this as an equity risk for non-native English speakers. Turnitin, however, says its current detector evaluation did not show statistically significant ELL bias and still tells instructors not to use AI reports as the sole basis for adverse action. Do not apply one multiplier to every detector.

Are neurodivergent students also at higher risk of AI detection false positives?

Yes, for similar structural reasons. Dyslexic writers often rely on formulaic sentence structures and familiar vocabulary — reducing perplexity. Autistic writers may prefer explicit, structured expression and consistent terminology over varied rhetorical style — reducing burstiness. Both patterns overlap with AI output signatures on the metrics most detectors use. Berkeley's D-Lab documented this risk, though systematic demographic data remains limited because vendors have not published neurodivergent-specific false positive rates.

What should I do if I am an ESL student falsely accused based on AI detection?

Present process evidence (dated drafts, research notes, browser history) and the Stanford HAI peer-reviewed research documenting systematic bias against ESL writers. Request cross-tool verification — divergent results between detectors are grounds for inconclusive classification. Cite Turnitin's own guidance that flags are not determinative findings. Academic misconduct defense attorneys confirm that detection output alone is legally insufficient as primary evidence, and institutions have a documented pattern of reversing false positive accusations when properly contested.

Have any universities stopped using AI detection because of the ESL bias?

Yes. UCLA, Yale, Vanderbilt, Johns Hopkins, and Northwestern are among at least twelve universities that have disabled Turnitin's AI detection or issued formal policies restricting its use in misconduct proceedings. UCLA's Humtech center specifically cited equity concerns for ESL and international students as the primary driver. Lund University has adopted a disclosure-based framework instead of detection, completely avoiding false positive risk.

Does the EU AI Act address AI detection bias against ESL writers?

The EU AI Act lists education and vocational training as a high-risk area, and the May 2026 EU timeline update says rules for systems used in certain high-risk areas, including education and employment, will apply from December 2, 2027. Product-embedded high-risk systems move to August 2, 2028. The practical requirement is still the same directionally: document accuracy and bias risk, disclose consequential use, and require human oversight before adverse decisions.

Is the Stanford HAI study still relevant in 2026?

Yes, if it is used precisely. The 61.22% TOEFL result is not a universal false-positive rate for every current detector or every writing sample. It is still relevant because public AI-authorship accusations in 2026 continue to cite detector scores, while source-backed guidance from Stanford HAI and Turnitin points in the same operational direction: do not treat a detector flag as standalone proof without drafts, process evidence, source review, and human judgment.

Are any AI detectors better than others for ESL writing?

Some newer detector methods claim lower false-positive and lower non-native-English bias than early GPT detector systems, but comparative ESL-specific testing across commercial detectors remains limited and vendor-controlled. Institutions should pilot-test any detection tool on a sample representative of their actual student population before deploying at scale in high-stakes contexts.

Check How Your Text Scores Across Multiple Signals

EyeSift's free text analyzer shows you perplexity, burstiness, and AI probability — giving you the full picture before any formal evaluation process.

Analyze Text Free