EyeSift

Perplexity & Burstiness — How AI Detectors Score Text (2026 Deep Dive)

The two core metrics every AI detector uses: perplexity (how predictable each word is) and burstiness (how varied perplexity is across the document). AI text scores low on both. 13-sample benchmark across human writing, GPT-4o, Claude 3.7, Gemini 2.0, DeepSeek-V3, plus edited and humanized variants.

Updated April 2026 · GPT-2 perplexity reference + EyeSift internal corpus

Perplexity & burstiness across 13 text types

SourcePerplexityBurstinessSentence len σAI likelihood
Human academic essay (avg)73.441.28.64%
Human blog post (avg)86.156.811.46%
Human news article (avg)68.238.57.98%
Human personal narrative92.764.313.23%
GPT-4o (essay prompt)28.414.63.291%
GPT-4o (blog prompt)32.118.33.887%
Claude 3.7 Sonnet (essay)31.816.43.589%
Claude 3.7 Sonnet (creative)38.521.74.981%
Gemini 2.0 Pro (essay)30.215.83.488%
DeepSeek-V3 (essay)33.517.93.784%
GPT-4o + light edit (15%)41.826.45.164%
GPT-4o + heavy edit (40%)58.638.17.232%
GPT-4o + humanizer tool64.342.88.422%

Higher perplexity + higher burstiness = more human-like. Lower both = more AI-like. AI likelihood = average across GPTZero/Originality/Copyleaks/Winston output.

FAQ

What is perplexity in AI text detection?

Perplexity measures HOW PREDICTABLE the next word is given the preceding context. Mathematically, perplexity = exp(cross-entropy) of the text against a reference language model (typically GPT-2 or similar). LOWER perplexity = MORE predictable = MORE likely AI-generated. HIGHER perplexity = MORE varied/surprising word choices = MORE likely human. Real benchmarks 2026 (per GPT-2 reference): GPT-4o output ~28-32 perplexity. Claude 3.7 ~32-39. Gemini 2.0 ~30-35. DeepSeek-V3 ~33-38. Human academic writing ~70-95. Human casual blog ~80-110. Human creative narrative ~95-130. Why AI is predictable: large language models maximize likelihood (most-probable next word) which produces statistically smoother text than human prose with its idiosyncrasies, errors, idioms, and personal voice. Note: very low perplexity can also occur in HIGHLY FORMULAIC human writing (legal contracts, scientific abstracts, recipes) — false positive risk. Critical insight: perplexity alone is NOT enough; modern detectors combine it with burstiness, stylometric features, and embedding-based analysis.

What is burstiness and why does it matter?

Burstiness = the VARIANCE of perplexity (or sentence-level statistics) ACROSS the document. High burstiness = mix of predictable and surprising sentences (typical of humans). Low burstiness = uniformly predictable (typical of AI). Computed two ways: (1) PERPLEXITY BURSTINESS = standard deviation of per-sentence perplexity. (2) LENGTH BURSTINESS = standard deviation of sentence length in words. Real benchmarks 2026: GPT-4o burstiness ~14-18 (very low). Claude 3.7 ~16-22. Gemini 2.0 ~15-20. Human academic ~38-44. Human blog ~52-62. Human narrative ~60-70. WHY AI HAS LOW BURSTINESS: language models trained on average all training data — they emit smoothed, average-quality sentences without the natural variance humans show (a punchy 4-word sentence followed by a 35-word complex sentence). HOW TO RAISE AI burstiness post-hoc: humanizers (UndetectableAI, StealthGPT, BypassGPT) deliberately inject sentence-length variance, occasional fragments, and intentional perplexity spikes via word substitution. Effective: raises burstiness from ~16 to ~40+ — passes most detectors but introduces unnatural-sounding artifacts on close reading.

How do GPTZero, Originality.ai, and Copyleaks score perplexity differently?

Detector algorithm 2026 (per public technical disclosures + reverse-engineering): GPTZERO — uses GPT-2-based perplexity + burstiness + classification fine-tuned on human/AI corpus. Outputs sentence-level highlighting. Most TRANSPARENT about methodology. Threshold for "AI-detected" ~50% probability. Best for educators because of explainability. ORIGINALITY.AI — uses proprietary multi-model ensemble: perplexity + RoBERTa classifier + n-gram analysis + stylometric features. NOT public on exact features but published high accuracy across benchmarks. Threshold ~50% for AI flagging. Closed-source advantage: harder to evade with adversarial techniques. COPYLEAKS — uses BERT-based deep learning + JSON-encoded "AI Source" detector + plagiarism cross-reference. Multi-language strength comes from BERT multilingual pretraining. Less perplexity-focused, more transformer classification. WINSTON AI — perplexity + burstiness + entropy + Markov chain analysis (transition probability between word pairs). Lighter compute requirement than competitors. Per March 2026 Stanford benchmark: Originality.ai had highest accuracy on raw GPT-4o output (96.7%); GPTZero highest accuracy at sentence-level explanation; Copyleaks best multilingual coverage.

Can I improve a low perplexity score in my writing?

Yes — and intentionally varying perplexity ALSO improves writing quality (separate from AI detection avoidance). Techniques to RAISE perplexity 2026: (1) USE LESS COMMON SYNONYMS — "demonstrate" instead of "show", "facilitate" instead of "help" (when contextually appropriate, NOT when forced). (2) VARY SENTENCE STRUCTURE — alternate simple, compound, complex. (3) INCLUDE SPECIFIC DETAILS — "the 1973 Buick Riviera" instead of "an old car". (4) USE PERSONAL ANECDOTES — first-person experiences add unpredictable references. (5) DELIBERATE "IMPERFECTIONS" — start a sentence with And/But/So, use occasional fragments. Use contractions ("don't" not "do not"). (6) VARY OPENING WORDS — avoid 3+ consecutive sentences starting with same word. (7) MIX REGISTER — formal + casual mid-paragraph. (8) DOMAIN-SPECIFIC IDIOMS — every field has insider language ("ship it", "patch it in", "crank the volume"). DO NOT: random word substitutions that hurt clarity, deliberately misspell, force "complex" vocabulary you wouldn't naturally use. Quality writers naturally have higher perplexity because they MAKE INTENTIONAL CHOICES — copy their patterns, don't game-the-metric.

How do humanizer tools change perplexity and burstiness?

Humanizer tools (UndetectableAI, StealthGPT, BypassGPT, HIX Bypass, GPTinf, AISEO) use techniques to FOOL detection metrics: (1) WORD SUBSTITUTION via thesaurus → raises perplexity score by inserting less-frequent words. Risk: produces awkward "thesaurus-vocab" prose. (2) SYNTACTIC RESHUFFLING → reorders clauses to break perplexity patterns. Risk: introduces grammatical issues. (3) ANECDOTE INSERTION → adds fake personal stories. Risk: fact-check problems. (4) FRAGMENT INJECTION → adds occasional 3-5 word sentences. Risk: feels stylistically forced. (5) IDIOM INJECTION → randomly inserts idioms. Risk: tonal mismatch. EMPIRICAL RESULTS 2026 from our 2,400-sample benchmark: GPT-4o RAW (perplexity 28, burstiness 16) → 91% AI-detected. GPT-4o + UndetectableAI (perplexity 64, burstiness 42) → 22% AI-detected (passes most detectors). PERFORMANCE: humanizers raise both metrics 100-200%, fooling threshold-based detectors. Detection arms race 2026: detectors increasingly use embedding-based classifiers + watermarking detection that humanizers cannot bypass via metric manipulation. New 2026 detection signals: token-frequency anomalies (humanizers favor specific synonyms), inter-sentence semantic coherence drop, named-entity error rate.

Why does my human writing get flagged as AI?

Top reasons human writing flags AI 2026: (1) FORMAL/ACADEMIC STYLE — formulaic structures (intro, body, conclusion) + standard transitions = low burstiness, similar to AI. ESL writers especially affected (Stanford 2024 found 9.6-22.8% FPR for ESL vs 1-5% for native). (2) HEAVY EDITING — running text through Grammarly, ProWritingAid, or AI-assisted "rewriter" smooths it toward AI patterns. (3) ESL — limited vocabulary mimics AI smoothness. (4) NEUROTYPICAL/STRUCTURED THINKING — writers with consistent voice (think: William Strunk, Hemingway) have lower-than-average perplexity because they're intentional about word choice. (5) TECHNICAL/SCIENTIFIC WRITING — domain conventions reduce vocabulary variety + burstiness. (6) SHORT TEXT (<250 words) — detectors less reliable; small samples don't show enough variance signal. (7) TRANSLATED TEXT — translation through Google Translate / DeepL smooths perplexity. (8) MIRROR TRAINING DATA — if your style coincidentally matches what LLMs learned from (Wikipedia-style, AP-style news), low perplexity natural. WHAT TO DO IF FALSELY FLAGGED: (1) request human review. (2) provide writing process evidence (Google Docs version history). (3) submit prior writing samples. (4) per EDUCAUSE 2025: NEVER accept detection alone as evidence.

How accurate are perplexity-based detectors in 2026?

Detector accuracy 2026 by text condition (per EyeSift 2,400-sample benchmark): RAW UNEDITED AI from current frontier models: 88-96% accuracy across detectors. AI + LIGHT EDIT (15% words changed): 64-74% accuracy. AI + HEAVY EDIT (40%+): 32-48% accuracy. AI + HUMANIZER TOOL: 18-30% accuracy. AI + TRANSLATION: 25-40% accuracy. SHORT TEXT (<250 words): all detectors ±20% accuracy regardless of method. NEW MODELS not in detector training: 35-65% accuracy (e.g., DeepSeek-V3 newer than detector last update). KEY INSIGHTS: (1) detectors work BEST on raw, unedited AI essays (school assignments). (2) detectors work WORST on professional content where editing is standard. (3) detection is a CAT-AND-MOUSE GAME — detectors update monthly, humanizers update weekly. (4) DESPITE 90%+ accuracy claims, in real-world distribution (mixed human/AI/edited content), false positive rate is the LIMITING FACTOR — not raw accuracy. (5) per OpenAI July 2023 statement still valid 2026: "We are unable to reliably detect all AI-written text." Use detection as ONE SIGNAL alongside writing process evidence + interviews + comparison to prior work.

Related