EyeSift
Technical Deep DiveApr 1, 2026· 16 min read

How Do AI Detectors Work? 3 Signals and False Positives

Reviewed by Brazora Monk·Last updated April 30, 2026

AI detectors usually score three signal families: statistical predictability, sentence variation, and model-specific fingerprints. Those signals can help, but they also create false positives in short text, formal writing, and non-native English. Here is how the systems work and where they fail.

Key Takeaways

  • AI text detectors work by measuring perplexity (statistical predictability), burstiness (sentence variance), and transformer-level linguistic fingerprints
  • Stanford University (Liang et al., 2023) found 61.22% of TOEFL essays by non-native English speakers were falsely flagged as AI — the highest-documented false positive rate for any detection technology
  • OpenAI shut down its own AI Classifier in July 2023: 26% detection rate, 9% false positive rate on human text
  • Watermarking (Google SynthID, Kirchenbauer et al. 2023) is theoretically more robust than statistical detection but is vulnerable to recursive paraphrasing attacks
  • University of Maryland researchers proved that as LLMs improve, reliable AI detection may become theoretically impossible (arXiv:2303.11156)

Quick Answer

AI detectors do not read text the way humans do. They estimate whether the token sequence looks statistically model-generated. The main signals are perplexity (how predictable the wording is), burstiness (how much sentence complexity varies), and classifier fingerprints learned from labeled human and AI samples. Watermarks work differently: they look for an embedded generation-time signal.

The practical caveat: a detector score is evidence, not proof. Short passages, polished academic prose, non-native English, heavy editing, and unseen model families can all break the signal.

Let us start by debunking the most common misconception about AI detectors: that they work by recognizing specific phrases or stylistic tics that AI models always produce. They do not. Modern AI detection is fundamentally a statistical inference problem — and like all statistical inference, it involves uncertainty, error rates, and failure modes that most users never see.

The stakes of getting this wrong are substantial. Vanderbilt University calculated that Turnitin's claimed 1% false positive rate, applied to their real submission volume of roughly 75,000 papers per year, would wrongly accuse approximately 750 students of AI misuse annually. They disabled the feature campus-wide in August 2023. Multiple UC campuses followed. Understanding the mechanics behind these systems is no longer optional for educators, publishers, and HR professionals deploying them at scale.

The Foundational Signal: Perplexity

Perplexity is the mathematical backbone of most AI text detectors. Technically, it measures how “surprised” a language model is by a sequence of tokens. The formula derives from cross-entropy: perplexity = 2^H(p), where H(p) is the cross-entropy of the text under the model's probability distribution.

When a large language model generates text, it selects tokens with high probability given the preceding context — words the model considers statistically expected. This produces low-perplexity output. Human writing, by contrast, makes unexpected lexical choices, employs idiomatic expressions, introduces unconventional framings, and deliberately varies phrasing. That produces higher perplexity.

GPTZero uses a perplexity threshold — scores below roughly 40 are treated as a strong AI signal, when combined with other features. But the signal is imperfect in a specific way: low perplexity is not unique to AI. Any writing that uses constrained, predictable vocabulary produces low perplexity — including formal academic prose, legal writing, and non-native English. The Declaration of Independence notoriously scores as “likely AI-generated” on most perplexity-based detectors.

The Second Signal: Burstiness

Burstiness measures the variance of perplexity across a document — how much sentence-level complexity fluctuates. Human writers naturally oscillate between short, punchy sentences and long, complex constructions. That oscillation is burstiness.

AI models, constrained by their smoothing objectives, tend toward uniformly structured sentences at consistent complexity levels. GPTZero considers burstiness scores below 0.30 a strong AI indicator; human writing typically scores 0.65–0.85. The intuition is sound, but the threshold is fragile: academic writing is deliberately low-burstiness. Technical documentation is deliberately low-burstiness. A writer under instruction to “be concise” produces low-burstiness text indistinguishable, statistically, from AI output.

The Equity Problem: Who Gets Falsely Accused

The perplexity/burstiness approach has a documented, severe bias that goes beyond theoretical concern. In April 2023, Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou at Stanford University published “GPT detectors are biased against non-native English writers” (arXiv:2304.02819) in Patterns (Cell Press).

They tested seven leading AI detectors — including GPTZero, ZeroGPT, Writer, Crossplag, CopyLeaks, Sapling, and Originality.ai — against TOEFL essays by non-native English speakers and essays by U.S.-born 8th graders. The results:

  • 61.22% of TOEFL essays (by non-native speakers) were misclassified as AI-generated
  • 97% of TOEFL essays were flagged by at least one of the seven detectors
  • For U.S.-born 8th graders: detectors were “near-perfect” at correctly identifying human-written text

Root cause: non-native speakers use constrained, predictable vocabulary that produces low perplexity — the exact same statistical signature detectors associate with AI. The Stanford study exposed that AI detection tools are not measuring “AI-ness” — they are measuring linguistic conformity. Non-native speakers, ESL students, and writers with formal styles all get caught in the same net.

Stanford HAI published guidance on this finding. The Markup followed with documented cases of international students falsely accused of cheating at U.S. universities. Vanderbilt University subsequently became one of the first institutions to disable Turnitin AI detection entirely, citing these risks explicitly in their public statement (August 2023).

Beyond Statistics: How Transformer Classifiers Work

More sophisticated detection approaches replace simple perplexity thresholds with fine-tuned transformer classifiers. Models like RoBERTa or DistilBERT are trained on millions of labeled human/AI text pairs. The classifier learns not just surface vocabulary patterns but deep distributional differences — patterns in attention distributions, embedding space clustering, and conditional probability sequences that are characteristic of specific LLM families.

RoBERTa achieved 96.1% accuracy in competitive benchmarks (SemEval-2024 Task 8), outperforming LSTM and GRU architectures. Research published in March 2025 (arXiv:2503.01659, “Detecting Stylistic Fingerprints of Large Language Models”) demonstrated that different LLM families leave distinct fingerprints beyond generic AI-vs-human classification — identifiable by patterns in sentence-initial word distributions, punctuation frequency, subordinate clause usage, and characteristic lexical preferences.

The limitation is model-specificity. Classifiers trained on GPT-3.5 outputs degrade when Claude, Gemini, or newly released models generate text. Each major model release partially invalidates existing classifiers. The RAID benchmark (arXiv:2405.07940) — the largest independent evaluation of AI detectors, comprising over 6 million generations across 11 LLMs, 8 domains, and 12 adversarial attack methods — found that detectors are “easily fooled by adversarial attacks and unseen generative models.”

DetectGPT: Probability Curvature Analysis

Eric Mitchell, Yann-Aël Le Bras, Ariel Khazatsky, Christopher D. Manning, and Chelsea Finn at Stanford University published DetectGPT at ICML 2023 (arXiv:2301.11305). It introduced a method that requires no training on labeled examples.

The key insight: text sampled from an LLM tends to occupy negative curvature regions of the log probability function. When you take an AI-generated passage and apply small random perturbations (via a T5 paraphrasing model), the perturbed versions are consistently less probable than the original under the source model — the original sits at a local maximum. For human text, perturbations show no systematic directional effect.

DetectGPT improved fake news detection AUROC from 0.81 (the best prior zero-shot method) to 0.95 AUROC on standard benchmarks — without any labeled training data. The limitation is computational: it requires running multiple forward passes through a large model for each detection call, making it expensive at scale.

Watermarking: The More Reliable Alternative

Watermarking does not detect AI text after the fact — it embeds a signal at the point of generation. John Kirchenbauer, Jonas Geiping, Yuxin Wen, John Katz, Ian Miers, and Tom Goldstein at the University of Maryland published the foundational paper in January 2023 (arXiv:2301.10226, presented at ICML 2023).

The mechanism:

  1. Before each token is generated, a cryptographic hash of the preceding token produces a pseudorandom seed
  2. That seed splits the full vocabulary into a “green list” (~50% of tokens) and a “red list” (~50%)
  3. A small bias (delta δ) is added to the logits of all green-list tokens, softly promoting their selection
  4. The signal is imperceptible to readers — the shift is statistically subtle
  5. Detection: count the fraction of green tokens in a passage. Watermarked text shows a statistically significant excess above 50% (z-test with interpretable p-values)

The paper demonstrated the watermark on multi-billion parameter models with negligible text quality impact. Detection achieves statistical significance with roughly 200–300 tokens. Scott Aaronson, then at OpenAI, confirmed OpenAI was developing a similar system internally, calling it potentially “virtually undetectable to the human eye.”

Google DeepMind deployed this approach as SynthID Text, now active in Gemini. SynthID uses a pseudorandom g-function keyed to a secret, applied at every token generation step. It is robust to mild paraphrasing, cropping, and word substitution — but degrades significantly under thorough rewriting or translation.

The Accuracy Gap: What Benchmarks Actually Show

The RAID benchmark (Dugan et al., University of Pennsylvania, May 2024 — arXiv:2405.07940) is the most rigorous independent evaluation of AI text detectors. It uses TPR@FPR=1% as its primary metric — the true positive rate (AI correctly identified) when the false positive rate is held to just 1%. Testing across 11 LLMs, 8 domains, and 12 adversarial attack methods:

DetectorVendor ClaimRAID / Independent ResultNotable Failure Mode
GPTZero99%+95.7% TPR @ 1% FPR (RAID)Formal writing, non-native English
Turnitin98% (self-reported)~80–84% real-world~4% false positive at sentence level
Copyleaks99.52%~79% independent testing~5% false positive (1 in 20 docs)
Originality.ai~90%+~79% (best in one independent test)Variable across content types
OpenAI ClassifierNot disclosed26% detection rate~9% false positive; shut down July 2023

Sources: RAID Benchmark (arXiv:2405.07940); independent testing by walterwrites.ai (2024); OpenAI public statement July 2023; Turnitin methodology documentation

The most striking data point: no commercial tool exceeded 80% overall accuracy in real-world independent testing, despite vendor claims of 95–99.52%. GPTZero's strong RAID results are from a formal, controlled benchmark — real-world accuracy on mixed, post-processed, or adversarially altered text is substantially lower. Understanding this gap is critical for anyone deploying these tools in consequential contexts.

For a broader analysis of how these limitations play out specifically in educational settings, see our article on how Turnitin AI detection works and its real-world accuracy for student submissions.

The Arms Race: Why Detection Always Lags Generation

The adversarial dynamic between AI generators and detectors follows a predictable escalation cycle that has now run through four distinct rounds:

Round 1 (2022): Statistical detectors. Simple perplexity thresholds. Defeated almost immediately by temperature-adjusted sampling — increasing the model's generation temperature raises perplexity without changing content quality.

Round 2 (early 2023): Classifier-based detectors. Fine-tuned models trained on GPT-3 outputs. Defeated by paraphrasing tools (Quillbot, etc.) and prompting AI to “write more naturally.” OpenAI's own classifier could not detect ChatGPT output run through basic paraphrasing.

Round 3 (2023): Watermarking. Kirchenbauer-style token biasing. Defeated by recursive paraphrasing attacks, as demonstrated by Sadasivan et al. at the University of Maryland (arXiv:2303.11156). Repeated paraphrasing with a second LLM removes the green/red token statistical pattern while preserving semantic content. The same paper showed a spoofing attack: an attacker can infer the hidden watermark signature and frame human-written text as AI-generated.

Round 4 (2024): Adversarial humanization. arXiv:2404.01907 demonstrated that current detection models can be compromised in under 10 seconds using minor targeted word substitutions — adversarial perturbations that preserve semantic meaning while defeating classification. Paraphrasing in a single pass reduces detector accuracy by over 54 percentage points.

Behind all four rounds is a theoretical limit, formalized by Sadasivan et al. (arXiv:2303.11156): as LLMs improve, the statistical distance between human and AI text distributions approaches zero. This implies that the AUROC of any possible detector approaches 0.5 — random chance — as model quality increases. Reliable AI text detection may be theoretically impossible for sufficiently advanced models generating text in their training distribution.

What Detectors Cannot Do

Understanding the hard boundaries matters as much as understanding the methods. Current AI detectors reliably fail in these specific conditions:

  • Short text: Statistical significance requires volume. Most detectors are unreliable below 150–250 words. The OpenAI Classifier explicitly warned users it was inaccurate on short snippets.
  • Mixed/hybrid content: Human-edited AI text, or AI-assisted human text, breaks binary classification. No tool reliably handles the spectrum between fully human and fully AI.
  • New model releases: Each major LLM release (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro) partially invalidates existing classifiers. The RAID benchmark confirmed detectors are “easily fooled by unseen generative models.”
  • Multilingual content: Most detectors were trained primarily on English. Accuracy drops sharply for other languages. Cross-lingual detection is described as “still in its infancy” in recent surveys.
  • Domain shift: Detectors trained on news articles and essays fail on code, poetry, dialogue, and technical documentation.
  • Formal registers: Academic writing, legal prose, and medical documentation are low-perplexity by design — systematic false positives in precisely the contexts where detection is most often deployed.

How to Use AI Detectors Responsibly

Given these documented limitations, responsible deployment means:

For educators: Treat detector output as one data point, not a verdict. A positive signal warrants a conversation with the student — not automatic punishment. Ask students about their writing process. Look at prior work for consistency. Vanderbilt's published rationale for disabling AI detection is worth reading for anyone setting institutional policy.

For publishers and HR: Never reject a submission or candidate based solely on an AI detection score. Use detection as a first-pass filter that flags content for human review — not as a standalone decision. Keep the false positive implications of the Stanford study in mind when screening applicants whose first language is not English.

For anyone: Submit the most complete text available. Short excerpts are statistically unreliable. Run multiple tools and compare — convergence increases confidence, divergence signals ambiguity. Use EyeSift's text analysis tool alongside contextual judgment rather than as a replacement for it.

For the parallel question of how detectors work on images rather than text — frequency fingerprinting, neural classifiers, and C2PA provenance metadata — see our in-depth guide on AI image detectors.

Frequently Asked Questions

What is perplexity in AI detection?

Perplexity measures how statistically predictable a piece of text is. AI text tends to be low-perplexity because language models select high-probability tokens. Human writing is less predictable. GPTZero uses a perplexity threshold (roughly below 40) as one signal among several, but low perplexity is not unique to AI — formal, repetitive human writing produces the same signature.

Are AI detectors accurate?

On controlled benchmarks, GPTZero reaches 95.7% TPR at a 1% false positive rate (RAID Benchmark, 2024). In real-world conditions, independent testing found no tool exceeded 80% overall accuracy. OpenAI shut down its own classifier in July 2023 after it achieved only a 26% detection rate with a 9% false positive rate on human text.

Why do AI detectors falsely flag non-native English speakers?

Non-native speakers use constrained, predictable vocabulary — the same low-perplexity signature detectors associate with AI. A Stanford University study (Liang et al., 2023) found 61.22% of TOEFL essays were misclassified as AI-generated by seven leading detectors. Vanderbilt University subsequently disabled Turnitin AI detection campus-wide citing these risks.

Can paraphrasing fool AI detectors?

Yes. University of Maryland researchers (Sadasivan et al., 2023) showed recursive paraphrasing with a second LLM defeats both watermarking and statistical detectors while preserving semantic content. One study found paraphrasing reduced detector accuracy by over 54 percentage points. This is the primary bypass technique used in practice.

What is SynthID and how does watermarking work?

Google DeepMind's SynthID Text embeds an invisible signal by biasing token selection during generation using a pseudorandom function keyed to a secret. Detection compares the token sequence against the expected watermarked distribution. SynthID is deployed in Gemini and is robust to mild paraphrasing, but degrades significantly under thorough rewriting or translation to another language.

What is DetectGPT?

DetectGPT (Stanford University, Mitchell et al., ICML 2023) detects AI text by analyzing probability curvature. AI-generated text occupies negative curvature regions — random perturbations consistently decrease its probability. DetectGPT achieved 0.95 AUROC on fake news detection without requiring labeled training examples, using only forward passes through a base language model for inference.

Do AI detectors work on short text?

No — they are unreliable below 150–250 words. Statistical methods need sufficient token count for significance. The now-discontinued OpenAI Classifier explicitly warned users it was inaccurate on short snippets. Most commercial tools recommend submitting at least 250 words for meaningful results. Email-length submissions should not be assessed with these tools.

Run a Free AI Detection Analysis

EyeSift's text analysis uses multiple detection methods — statistical, transformer-based, and contextual — to give you a more complete picture than any single metric alone.

Analyze Text Now