EyeSift
Technical Deep DiveUpdated Jun 11, 2026· 16 min read

How Do AI Detectors Work? 5 Signals, Watermarks and Limits

Reviewed by Brazora Monk·Last updated June 11, 2026

AI detectors usually score five evidence families: statistical predictability, sentence variation, classifier fingerprints, watermark or provenance clues, and human context. Those signals can help, but they also create false positives in short text, formal writing, translated drafts, and non-native English. If you are asking how do AI detectors work, how does AI detection work, or do AI detectors work at all, the useful answer is narrower than most tools admit: they estimate review risk, not authorship proof.

Key Takeaways

  • AI text detectors work by measuring perplexity (statistical predictability), burstiness (sentence variance), and transformer-level linguistic fingerprints
  • Stanford University (Liang et al., 2023) found 61.22% of TOEFL essays by non-native English speakers were falsely flagged as AI — the highest-documented false positive rate for any detection technology
  • OpenAI shut down its own AI Classifier in July 2023: 26% detection rate, 9% false positive rate on human text
  • Watermarking and provenance systems such as SynthID-style text marks or C2PA-style metadata are stronger when the generator creates the signal and the verifier can read it, but they are still not universal proof of authorship
  • University of Maryland researchers proved that as LLMs improve, reliable AI detection may become theoretically impossible (arXiv:2303.11156)

Quick Answer

AI detectors do not read text the way humans do. They estimate whether the token sequence looks statistically model-generated. The main signals are perplexity (how predictable the wording is), burstiness (how much sentence complexity varies), and classifier fingerprints learned from labeled human and AI samples. Watermarks work differently: they look for an embedded generation-time signal.

The practical caveat: a detector score is evidence, not proof. Short passages, polished academic prose, non-native English, heavy editing, and unseen model families can all break the signal.

AI detector signal map

A strong AI detector should not rely on one shortcut. It should separate text statistics from provenance and human-context evidence, then lower confidence when the sample is short, informal, translated, or outside the detector's training distribution.

SignalWhat it measuresCommon failure mode
PredictabilityWhether the next words look unusually expected under a language-model style distribution.Formulaic human writing, non-native English, legal prose, technical docs, and short templates can look predictable.
VariationSentence-length spread, rhythm changes, punctuation texture, and whether the sample has natural unevenness.One-sentence chat, bullet lists, polished essays, and edited corporate copy often have too little variation to score confidently.
Classifier fingerprintsPatterns learned from labeled human and AI examples, including phrasing, transitions, structure, and model-family artifacts.New model releases, paraphrasing, translation, and topic shifts can move text outside the detector training distribution.
Watermark or provenance cluesSignals embedded or attached at generation time, such as SynthID-style watermarks or C2PA/content credentials.They only help when the generator added the signal and the checker can access the matching verification method.
Context evidenceDraft history, sources, assignment fit, author voice, edit trail, metadata, and whether the claims can be verified.A detector that ignores context can overreact to weak text signals and underreact to well-edited AI-assisted drafts.

Source-Reviewed Update - June 11, 2026

The best current answer to do AI detectors work? is: sometimes, when the sample is long, unedited, mostly English, and close to the detector's training distribution. They become unreliable on short snippets, translated text, heavily edited drafts, non-native English, formal prose, and outputs from newer or unseen models.

OpenAI Classifier

Retired in July 2023 because of low accuracy: 26% true positive rate and 9% false positives on OpenAI's own challenge set.

OpenAI source

Short Text

Short passages do not give statistical detectors enough tokens. OpenAI explicitly warned its classifier was very unreliable below 1,000 characters.

Short-text false positives

Stanford Bias Finding

Stanford HAI reported that detectors mislabeled more than half of TOEFL essays by non-native English writers as AI-generated.

Stanford HAI source

Benchmark Robustness

RAID tested more than 6 million generations and found detectors are easy to fool with attacks, sampling changes, and unseen models.

RAID benchmark source

Provenance Layer

NIST AI 600-1 treats provenance tracking, watermarking, metadata, fingerprinting, and synthetic-content detection as transparency aids that should be combined with broader accountability.

NIST AI 600-1 source

For a practical scan, use the AI text analyzer, then compare the result against our guides to perplexity and burstiness, false positives, and the best AI detectors in 2026.

How to read an AI detector result

A detector score is most useful when it changes the review workflow, not when it becomes the verdict. A short informal human message can still produce a mixed score because there may be almost no sentence-length variation, little punctuation, sparse context, and too few words for statistical confidence.

Low score

Usually means the sample has more human-like variation or too little AI evidence.

Still review sources, citations, and context when the decision matters.

Mixed / uncertain

Means the signals are weak, balanced, or unreliable for the sample type.

Do not accuse or approve automatically. Ask for drafts, revision history, sources, and a longer sample.

High score

Means the text matches patterns the detector associates with AI-generated writing.

Use it as a review trigger, then check human evidence before any policy or publishing decision.

To test a sample, use the AI text analyzer. For implementation details, cite the EyeSift methodology and this technical guide together; for backend endpoint examples, use the Originality.ai scan/ai API guide, and for paid-tool buying decisions use the AI detector pricing 2026 guide.

Let us start by debunking the most common misconception about AI detectors: that they work by recognizing specific phrases or stylistic tics that AI models always produce. They do not. Modern AI detection is fundamentally a statistical inference problem — and like all statistical inference, it involves uncertainty, error rates, and failure modes that most users never see. That is why the question how does AI detection work? needs to be answered together with the harder question: when does it stop working?

The stakes of getting this wrong are substantial. Vanderbilt University calculated that Turnitin's claimed 1% false positive rate, applied to their real submission volume of roughly 75,000 papers per year, would wrongly accuse approximately 750 students of AI misuse annually. They disabled the feature campus-wide in August 2023. Multiple UC campuses followed. Understanding the mechanics behind these systems is no longer optional for educators, publishers, and HR professionals deploying them at scale.

The Foundational Signal: Perplexity

Perplexity is the mathematical backbone of most AI text detectors. Technically, it measures how “surprised” a language model is by a sequence of tokens. The formula derives from cross-entropy: perplexity = 2^H(p), where H(p) is the cross-entropy of the text under the model's probability distribution.

When a large language model generates text, it selects tokens with high probability given the preceding context — words the model considers statistically expected. This produces low-perplexity output. Human writing, by contrast, makes unexpected lexical choices, employs idiomatic expressions, introduces unconventional framings, and deliberately varies phrasing. That produces higher perplexity.

GPTZero uses a perplexity threshold — scores below roughly 40 are treated as a strong AI signal, when combined with other features. But the signal is imperfect in a specific way: low perplexity is not unique to AI. Any writing that uses constrained, predictable vocabulary produces low perplexity — including formal academic prose, legal writing, and non-native English. The Declaration of Independence notoriously scores as “likely AI-generated” on most perplexity-based detectors.

The Second Signal: Burstiness

Burstiness measures the variance of perplexity across a document — how much sentence-level complexity fluctuates. Human writers naturally oscillate between short, punchy sentences and long, complex constructions. That oscillation is burstiness.

AI models, constrained by their smoothing objectives, tend toward uniformly structured sentences at consistent complexity levels. GPTZero considers burstiness scores below 0.30 a strong AI indicator; human writing typically scores 0.65–0.85. The intuition is sound, but the threshold is fragile: academic writing is deliberately low-burstiness. Technical documentation is deliberately low-burstiness. A writer under instruction to “be concise” produces low-burstiness text indistinguishable, statistically, from AI output.

The Equity Problem: Who Gets Falsely Accused

The perplexity/burstiness approach has a documented, severe bias that goes beyond theoretical concern. In April 2023, Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou at Stanford University published “GPT detectors are biased against non-native English writers” (arXiv:2304.02819) in Patterns (Cell Press).

They tested seven leading AI detectors — including GPTZero, ZeroGPT, Writer, Crossplag, CopyLeaks, Sapling, and Originality.ai — against TOEFL essays by non-native English speakers and essays by U.S.-born 8th graders. The results:

  • 61.22% of TOEFL essays (by non-native speakers) were misclassified as AI-generated
  • 97% of TOEFL essays were flagged by at least one of the seven detectors
  • For U.S.-born 8th graders: detectors were “near-perfect” at correctly identifying human-written text

Root cause: non-native speakers use constrained, predictable vocabulary that produces low perplexity — the exact same statistical signature detectors associate with AI. The Stanford study exposed that AI detection tools are not measuring “AI-ness” — they are measuring linguistic conformity. Non-native speakers, ESL students, and writers with formal styles all get caught in the same net.

Stanford HAI published guidance on this finding. The Markup followed with documented cases of international students falsely accused of cheating at U.S. universities. Vanderbilt University subsequently became one of the first institutions to disable Turnitin AI detection entirely, citing these risks explicitly in their public statement (August 2023).

Beyond Statistics: How Transformer Classifiers Work

More sophisticated detection approaches replace simple perplexity thresholds with fine-tuned transformer classifiers. Models like RoBERTa or DistilBERT are trained on millions of labeled human/AI text pairs. The classifier learns not just surface vocabulary patterns but deep distributional differences — patterns in attention distributions, embedding space clustering, and conditional probability sequences that are characteristic of specific LLM families.

RoBERTa achieved 96.1% accuracy in competitive benchmarks (SemEval-2024 Task 8), outperforming LSTM and GRU architectures. Research published in March 2025 (arXiv:2503.01659, “Detecting Stylistic Fingerprints of Large Language Models”) demonstrated that different LLM families leave distinct fingerprints beyond generic AI-vs-human classification — identifiable by patterns in sentence-initial word distributions, punctuation frequency, subordinate clause usage, and characteristic lexical preferences.

The limitation is model-specificity. Classifiers trained on GPT-3.5 outputs degrade when Claude, Gemini, or newly released models generate text. Each major model release partially invalidates existing classifiers. The RAID benchmark (arXiv:2405.07940) — the largest independent evaluation of AI detectors, comprising over 6 million generations across 11 LLMs, 8 domains, and 12 adversarial attack methods — found that detectors are “easily fooled by adversarial attacks and unseen generative models.”

DetectGPT: Probability Curvature Analysis

Eric Mitchell, Yann-Aël Le Bras, Ariel Khazatsky, Christopher D. Manning, and Chelsea Finn at Stanford University published DetectGPT at ICML 2023 (arXiv:2301.11305). It introduced a method that requires no training on labeled examples.

The key insight: text sampled from an LLM tends to occupy negative curvature regions of the log probability function. When you take an AI-generated passage and apply small random perturbations (via a T5 paraphrasing model), the perturbed versions are consistently less probable than the original under the source model — the original sits at a local maximum. For human text, perturbations show no systematic directional effect.

DetectGPT improved fake news detection AUROC from 0.81 (the best prior zero-shot method) to 0.95 AUROC on standard benchmarks — without any labeled training data. The limitation is computational: it requires running multiple forward passes through a large model for each detection call, making it expensive at scale.

Watermarking: The More Reliable Alternative

Watermarking does not detect AI text after the fact — it embeds a signal at the point of generation. John Kirchenbauer, Jonas Geiping, Yuxin Wen, John Katz, Ian Miers, and Tom Goldstein at the University of Maryland published the foundational paper in January 2023 (arXiv:2301.10226, presented at ICML 2023).

The mechanism:

  1. Before each token is generated, a cryptographic hash of the preceding token produces a pseudorandom seed
  2. That seed splits the full vocabulary into a “green list” (~50% of tokens) and a “red list” (~50%)
  3. A small bias (delta δ) is added to the logits of all green-list tokens, softly promoting their selection
  4. The signal is imperceptible to readers — the shift is statistically subtle
  5. Detection: count the fraction of green tokens in a passage. Watermarked text shows a statistically significant excess above 50% (z-test with interpretable p-values)

The paper demonstrated the watermark on multi-billion parameter models with negligible text quality impact. Detection achieves statistical significance with roughly 200–300 tokens. Scott Aaronson, then at OpenAI, confirmed OpenAI was developing a similar system internally, calling it potentially “virtually undetectable to the human eye.”

Google DeepMind documents this approach as SynthID Text, a logits-processor technique that applies a pseudorandom g-function during generation. Google says the watermark can survive some transformations such as cropping, limited word changes, and mild paraphrasing, but it has limits and should be combined with other provenance and review signals.

The Accuracy Gap: What Benchmarks Actually Show

The RAID benchmark (Dugan et al., University of Pennsylvania, May 2024 — arXiv:2405.07940) is the most rigorous independent evaluation of AI text detectors. It uses TPR@FPR=1% as its primary metric — the true positive rate (AI correctly identified) when the false positive rate is held to just 1%. Testing across 11 LLMs, 8 domains, and 12 adversarial attack methods:

DetectorVendor ClaimRAID / Independent ResultNotable Failure Mode
GPTZero99%+95.7% TPR @ 1% FPR (RAID)Formal writing, non-native English
Turnitin98% (self-reported)~80–84% real-world~4% false positive at sentence level
Copyleaks99.52%~79% independent testing~5% false positive (1 in 20 docs)
Originality.ai~90%+~79% (best in one independent test)Variable across content types
OpenAI ClassifierNot disclosed26% detection rate~9% false positive; shut down July 2023

Sources: RAID Benchmark; independent testing by walterwrites.ai (2024); OpenAI public statement; Turnitin methodology documentation

The most striking data point: no commercial tool exceeded 80% overall accuracy in real-world independent testing, despite vendor claims of 95–99.52%. GPTZero's strong RAID results are from a formal, controlled benchmark — real-world accuracy on mixed, post-processed, or adversarially altered text is substantially lower. Understanding this gap is critical for anyone deploying these tools in consequential contexts.

For a broader analysis of how these limitations play out specifically in educational settings, see our article on how Turnitin AI detection works and its real-world accuracy for student submissions.

The Arms Race: Why Detection Always Lags Generation

The adversarial dynamic between AI generators and detectors follows a predictable escalation cycle that has now run through four distinct rounds:

Round 1 (2022): Statistical detectors. Simple perplexity thresholds. Defeated almost immediately by temperature-adjusted sampling — increasing the model's generation temperature raises perplexity without changing content quality.

Round 2 (early 2023): Classifier-based detectors. Fine-tuned models trained on GPT-3 outputs. Defeated by paraphrasing tools (Quillbot, etc.) and prompting AI to “write more naturally.” OpenAI's own classifier could not detect ChatGPT output run through basic paraphrasing.

Round 3 (2023): Watermarking. Kirchenbauer-style token biasing. Defeated by recursive paraphrasing attacks, as demonstrated by Sadasivan et al. at the University of Maryland (arXiv:2303.11156). Repeated paraphrasing with a second LLM removes the green/red token statistical pattern while preserving semantic content. The same paper showed a spoofing attack: an attacker can infer the hidden watermark signature and frame human-written text as AI-generated.

Round 4 (2024): Adversarial humanization. arXiv:2404.01907 demonstrated that current detection models can be compromised in under 10 seconds using minor targeted word substitutions — adversarial perturbations that preserve semantic meaning while defeating classification. Paraphrasing in a single pass reduces detector accuracy by over 54 percentage points.

Behind all four rounds is a theoretical limit, formalized by Sadasivan et al. (arXiv:2303.11156): as LLMs improve, the statistical distance between human and AI text distributions approaches zero. This implies that the AUROC of any possible detector approaches 0.5 — random chance — as model quality increases. Reliable AI text detection may be theoretically impossible for sufficiently advanced models generating text in their training distribution.

What Detectors Cannot Do

Understanding the hard boundaries matters as much as understanding the methods. Current AI detectors reliably fail in these specific conditions:

  • Short text: Statistical significance requires volume. Most detectors are unreliable below 150–250 words. The OpenAI Classifier explicitly warned users it was inaccurate on short snippets.
  • Mixed/hybrid content: Human-edited AI text, or AI-assisted human text, breaks binary classification. No tool reliably handles the spectrum between fully human and fully AI.
  • New model releases: Each major LLM release (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro) partially invalidates existing classifiers. The RAID benchmark confirmed detectors are “easily fooled by unseen generative models.”
  • Multilingual content: Most detectors were trained primarily on English. Accuracy drops sharply for other languages. Cross-lingual detection is described as “still in its infancy” in recent surveys.
  • Domain shift: Detectors trained on news articles and essays fail on code, poetry, dialogue, and technical documentation.
  • Formal registers: Academic writing, legal prose, and medical documentation are low-perplexity by design — systematic false positives in precisely the contexts where detection is most often deployed.

How to Use AI Detectors Responsibly

Given these documented limitations, responsible deployment means:

For educators: Treat detector output as one data point, not a verdict. A positive signal warrants a conversation with the student — not automatic punishment. Ask students about their writing process. Look at prior work for consistency. Vanderbilt's published rationale for disabling AI detection is worth reading for anyone setting institutional policy.

For publishers and HR: Never reject a submission or candidate based solely on an AI detection score. Use detection as a first-pass filter that flags content for human review — not as a standalone decision. Keep the false positive implications of the Stanford study in mind when screening applicants whose first language is not English.

For anyone: Submit the most complete text available. Short excerpts are statistically unreliable. Run multiple tools and compare — convergence increases confidence, divergence signals ambiguity. Use EyeSift's text analysis tool alongside contextual judgment rather than as a replacement for it.

For the parallel question of how detectors work on images rather than text — frequency fingerprinting, neural classifiers, C2PA provenance metadata, and watermark signals — see our in-depth guides on AI image detectors, C2PA deepfake detection, and C2PA adoption status in 2026.

Frequently Asked Questions

What is perplexity in AI detection?

Perplexity measures how statistically predictable a piece of text is. AI text tends to be low-perplexity because language models select high-probability tokens. Human writing is less predictable. GPTZero uses a perplexity threshold (roughly below 40) as one signal among several, but low perplexity is not unique to AI — formal, repetitive human writing produces the same signature.

Are AI detectors accurate?

On controlled benchmarks, GPTZero reaches 95.7% TPR at a 1% false positive rate (RAID Benchmark, 2024). In real-world conditions, independent testing found no tool exceeded 80% overall accuracy. OpenAI shut down its own classifier in July 2023 after it achieved only a 26% detection rate with a 9% false positive rate on human text.

Why do AI detectors falsely flag non-native English speakers?

Non-native speakers use constrained, predictable vocabulary — the same low-perplexity signature detectors associate with AI. A Stanford University study (Liang et al., 2023) found 61.22% of TOEFL essays were misclassified as AI-generated by seven leading detectors. Vanderbilt University subsequently disabled Turnitin AI detection campus-wide citing these risks.

Can paraphrasing fool AI detectors?

Yes. University of Maryland researchers (Sadasivan et al., 2023) showed recursive paraphrasing with a second LLM defeats both watermarking and statistical detectors while preserving semantic content. One study found paraphrasing reduced detector accuracy by over 54 percentage points. This is the primary bypass technique used in practice.

What is SynthID and how does watermarking work?

Google DeepMind's SynthID Text embeds an invisible signal by biasing token selection during generation using a pseudorandom function keyed to a secret. Detection compares the token sequence against the expected watermarked distribution. Google describes SynthID Text as robust to some transformations such as cropping, limited word changes, and mild paraphrasing, but it has limits and is not a universal authorship detector.

What is DetectGPT?

DetectGPT (Stanford University, Mitchell et al., ICML 2023) detects AI text by analyzing probability curvature. AI-generated text occupies negative curvature regions — random perturbations consistently decrease its probability. DetectGPT achieved 0.95 AUROC on fake news detection without requiring labeled training examples, using only forward passes through a base language model for inference.

Do AI detectors work on short text?

No — they are unreliable below 150–250 words. Statistical methods need sufficient token count for significance. The now-discontinued OpenAI Classifier explicitly warned users it was inaccurate on short snippets. Most commercial tools recommend submitting at least 250 words for meaningful results. Email-length submissions should not be assessed with these tools.

Run a Free AI Detection Analysis

EyeSift's text analysis uses multiple detection methods — statistical, transformer-based, and contextual — to give you a more complete picture than any single metric alone.

Analyze Text Now