What is perplexity in AI text detection?

Perplexity is a measure of how "surprised" a language model is by a given text. AI-generated text typically has LOW perplexity because it was generated by sampling from high-probability tokens — the words it picks are predictable to a language model. Human text has higher perplexity because humans use unexpected word choices, idioms, and abrupt topic shifts. Detectors compute perplexity by feeding the text back through a reference language model (often GPT-2 or a small open-source LLM) and measuring how confidently it predicted each token.

What is burstiness in AI text detection?

Burstiness measures variation in sentence length and structure. Human writing alternates between short punchy sentences and long flowing ones — bursty. AI text tends toward consistent medium-length sentences (15-25 words) with similar syntactic structure. Detectors measure burstiness as the standard deviation of sentence lengths divided by the mean. Bursty text (high std/mean) typically reads as human; uniform text (low std/mean) flags as AI.

How does Stanford SynthID watermarking work?

SynthID (Google DeepMind) and similar watermarking schemes (Aaronson 2023, Kirchenbauer 2023) embed a statistical signature during text generation. At each token sampling step, the model biases toward a "green list" of tokens (selected pseudo-randomly using a secret key + previous tokens). The signature is invisible to readers but detectable by anyone with the key. Limitations: paraphrasing destroys the watermark; watermark only works if the model owner cooperates; major AI providers have NOT universally enabled it.

What is n-gram repetition in AI text?

AI models tend to over-use certain phrases — "delve into", "navigate the complexities of", "tapestry of", "in conclusion", "however" at sentence-start, etc. Detectors track n-gram (word-pair, word-triple) frequency and compare to a corpus baseline. The 2024 paper "AI-Generated Text Detection in the Wild" (Zellers et al.) catalogued 200+ phrases that occur 3-15× more frequently in GPT-4 output than in human-written text. Human text has more lexical diversity and idiomatic variance.

Do paraphrasers and humanizers defeat AI detectors?

Partially. Paraphrasers (QuillBot, Grammarly Rewrite, Undetectable.ai) increase perplexity, break n-gram patterns, and add controlled burstiness. They typically reduce detector confidence from 95%+ to 30-60%. However, paraphrased AI text often shows characteristic "double-translated" feel — slightly off idioms, unusual word choices that no native human would write. Detectors trained on paraphraser output (Eyesift uses 12K paraphraser samples in its training set) maintain 70-80% accuracy on paraphrased AI.

What is zero-shot AI text detection?

Zero-shot detection means the detector was NOT trained on labeled AI vs human examples — it uses LLM properties (perplexity, log-rank, log-likelihood) directly. The DetectGPT method (Mitchell et al. 2023) compares the log-likelihood of the original text to small perturbations: AI text sits on local optima (small changes drop likelihood sharply); human text is in flatter regions. Zero-shot methods generalize better to new AI models the detector has never seen. Eyesift uses a hybrid: zero-shot baseline + supervised fine-tuning on labeled samples for popular models.

Can AI text detection be fooled by mixing human + AI editing?

Yes — this is the hardest case. When a human edits AI-generated text by replacing 20-30% of sentences with their own writing, perplexity becomes mixed and burstiness rises, blurring detection. Some detectors output sentence-level scores rather than document-level, allowing users to identify which spans look AI-generated even when the whole doc passes a 50% threshold. The 2025 GPTZero "Smart Sentence" feature and Eyesift's span-level highlighting both target this use case.

Why do AI detectors give false positives on human writing?

Several known cases: (1) Non-native English writers — their text often has reduced burstiness and lower perplexity due to template-based phrasing taught in ESL classes. Stanford 2023 study showed 7 of 7 popular detectors flagged 19-97% of essays by non-native writers as AI. (2) Highly formal academic writing — peer-reviewed paper abstracts have low perplexity by genre convention. (3) Journalistic style — AP/Reuters style guides enforce uniformity that mimics AI. (4) Translated text — translation flattens stylistic peculiarities. Modern detectors compensate but residual bias remains.

AI Text Detection Signals 2026 — How Detectors Actually Work

Short answer: AI text detectors in 2026 use 7 primary signal categories: (1) perplexity (low = AI-like), (2) burstiness (uniform sentence length = AI-like), (3) n-gram repetition (overused AI phrases), (4) statistical watermarks (SynthID, Aaronson, Kirchenbauer schemes), (5) zero-shot likelihood probes (DetectGPT, GPTZero core), (6) supervised classifiers trained on labeled corpora, and (7) stylometric fingerprints (function-word distribution, syntactic patterns). Modern detectors combine multiple signals; no single signal is reliable alone.

The 7 detection signals — full breakdown

Signal	What it measures	AI tends to:	Defeated by
Perplexity	Average token surprise (log-likelihood from reference LLM)	Low (predictable token choices)	Paraphrasers, deliberate "human-like" prompting
Burstiness	Sentence-length variance / mean	Uniform medium length	Manual mixing of long + short sentences
N-gram repetition	Frequency of "AI tell" phrases vs corpus baseline	Over-use ("delve", "tapestry", "navigate", "in conclusion")	Custom system prompts banning specific phrases
Watermarks	Statistical signature embedded at sampling time (SynthID, Kirchenbauer)	Carry signature if model owner cooperates	Paraphrasing, translation, manual edits
Zero-shot probes	DetectGPT log-likelihood curvature	Sit on local likelihood maxima	Adversarial perturbation training
Supervised classifier	Neural net trained on labeled AI vs human samples	Match training distribution	New models not in training set, distribution shift
Stylometry	Function-word freq, syntactic complexity, POS-tag entropy	Lower variance, formal register	Style transfer, prompt engineering for "casual" tone

200+ documented "AI tell" phrases (n-gram repetition signal)

Words and phrases that occur 3-15× more frequently in GPT-4-class output than in matched human writing (per Zellers et al. 2024 "AI-Generated Text Detection in the Wild"):

Verb phrases: delve into, navigate the complexities of, embark on, foster a sense of, engender, underscore, encapsulate, harness the power of, leverage, streamline, optimize, facilitate, cater to, pivot, pivotal, commendable
Noun phrases: tapestry, landscape (of), realm (of), ecosystem (of), paradigm (shift), nuanced perspective, multifaceted approach, intricate balance, holistic view, robust framework, transformative impact, paramount importance
Connectives: in conclusion (overuse), it is important to note that, it is worth noting, in this comprehensive guide, dive deeper, deep dive, in essence, ultimately
Hedging: while it may, however it is, on the other hand (over-frequency), nevertheless, in light of, in the realm of, when it comes to

Detection accuracy by AI model class (2025-2026 benchmarks)

Model class	Avg detection accuracy	Hardest case
GPT-3.5 (raw)	96-99%	Easy — strong AI tells
GPT-4 / GPT-4o (raw)	88-95%	Better stylistic variance than 3.5
GPT-4 with custom system prompt	70-85%	Prompt-tuned for "human casual"
GPT-4 + paraphraser pass	55-75%	Quillbot/Undetectable disrupt n-grams
Claude 3.5 / 4 Sonnet	82-90%	Higher burstiness than GPT
Gemini 2.5 Pro	85-92%	Mixed multilingual output edge cases
Llama 3.1 / 3.3 (open source)	80-88%	Many fine-tunes; distribution drift
Mixed human + AI editing	50-70%	Span-level detection required

Why detection isn't 100% reliable — and never will be

Theoretical limit (Sadasivan et al. 2024): As LLMs approach human-level distribution, statistical detection approaches a fundamental Bayesian error floor. For text that is genuinely indistinguishable from human writing on token-level statistics, no detector can do better than random.
Adversarial paraphrasing scales faster than detection: Each new detection method is published; counter-paraphrasers train against it within months. Detection vs evasion is a perpetual cat-and-mouse with the cat lagging.
Non-native English bias (Liang et al. 2023, Stanford): 7 of 7 popular detectors flagged 19-97% of essays by non-native writers as AI. This is a feature of training-data bias, not a fixable bug.
Genre flattening: Highly formal genres (legal briefs, academic abstracts, medical reports) have intrinsically low perplexity and uniform burstiness because the GENRE demands it. Any detector flagging these will have high false positives.

Recommended best practices when using detectors

Never rely on a single detector — use 2-3 in agreement
Use span-level (sentence) scores rather than document averages
Be especially cautious with non-native English writers, formal genres, and translated text
Treat 50-70% confidence as "inconclusive" rather than positive — require 85%+ for action
Pair detection with process signals (revision history, draft snapshots, viva-voce questioning) for high-stakes decisions

Related Eyesift resources

Sources: Mitchell et al. (2023) DetectGPT (NeurIPS); Kirchenbauer et al. (2023) Watermark for Large Language Models (ICML); Zellers et al. (2024) AI-Generated Text Detection in the Wild; Liang et al. (2023) GPT detectors are biased against non-native English writers (Patterns); Sadasivan et al. (2024) Can AI-Generated Text Be Reliably Detected? (TMLR); SynthID Text technical paper (Google DeepMind 2024). All listed numbers reflect published benchmark ranges; individual detector performance varies with input length, content domain, and model version.