AI Text Detection Signals 2026 — How Detectors Actually Work
Short answer: AI text detectors in 2026 use 7 primary signal categories: (1) perplexity (low = AI-like), (2) burstiness (uniform sentence length = AI-like), (3) n-gram repetition (overused AI phrases), (4) statistical watermarks (SynthID, Aaronson, Kirchenbauer schemes), (5) zero-shot likelihood probes (DetectGPT, GPTZero core), (6) supervised classifiers trained on labeled corpora, and (7) stylometric fingerprints (function-word distribution, syntactic patterns). Modern detectors combine multiple signals; no single signal is reliable alone.
The 7 detection signals — full breakdown
| Signal | What it measures | AI tends to: | Defeated by |
|---|---|---|---|
| Perplexity | Average token surprise (log-likelihood from reference LLM) | Low (predictable token choices) | Paraphrasers, deliberate "human-like" prompting |
| Burstiness | Sentence-length variance / mean | Uniform medium length | Manual mixing of long + short sentences |
| N-gram repetition | Frequency of "AI tell" phrases vs corpus baseline | Over-use ("delve", "tapestry", "navigate", "in conclusion") | Custom system prompts banning specific phrases |
| Watermarks | Statistical signature embedded at sampling time (SynthID, Kirchenbauer) | Carry signature if model owner cooperates | Paraphrasing, translation, manual edits |
| Zero-shot probes | DetectGPT log-likelihood curvature | Sit on local likelihood maxima | Adversarial perturbation training |
| Supervised classifier | Neural net trained on labeled AI vs human samples | Match training distribution | New models not in training set, distribution shift |
| Stylometry | Function-word freq, syntactic complexity, POS-tag entropy | Lower variance, formal register | Style transfer, prompt engineering for "casual" tone |
200+ documented "AI tell" phrases (n-gram repetition signal)
Words and phrases that occur 3-15× more frequently in GPT-4-class output than in matched human writing (per Zellers et al. 2024 "AI-Generated Text Detection in the Wild"):
Noun phrases: tapestry, landscape (of), realm (of), ecosystem (of), paradigm (shift), nuanced perspective, multifaceted approach, intricate balance, holistic view, robust framework, transformative impact, paramount importance
Connectives: in conclusion (overuse), it is important to note that, it is worth noting, in this comprehensive guide, dive deeper, deep dive, in essence, ultimately
Hedging: while it may, however it is, on the other hand (over-frequency), nevertheless, in light of, in the realm of, when it comes to
Detection accuracy by AI model class (2025-2026 benchmarks)
| Model class | Avg detection accuracy | Hardest case |
|---|---|---|
| GPT-3.5 (raw) | 96-99% | Easy — strong AI tells |
| GPT-4 / GPT-4o (raw) | 88-95% | Better stylistic variance than 3.5 |
| GPT-4 with custom system prompt | 70-85% | Prompt-tuned for "human casual" |
| GPT-4 + paraphraser pass | 55-75% | Quillbot/Undetectable disrupt n-grams |
| Claude 3.5 / 4 Sonnet | 82-90% | Higher burstiness than GPT |
| Gemini 2.5 Pro | 85-92% | Mixed multilingual output edge cases |
| Llama 3.1 / 3.3 (open source) | 80-88% | Many fine-tunes; distribution drift |
| Mixed human + AI editing | 50-70% | Span-level detection required |
Why detection isn't 100% reliable — and never will be
- Theoretical limit (Sadasivan et al. 2024): As LLMs approach human-level distribution, statistical detection approaches a fundamental Bayesian error floor. For text that is genuinely indistinguishable from human writing on token-level statistics, no detector can do better than random.
- Adversarial paraphrasing scales faster than detection: Each new detection method is published; counter-paraphrasers train against it within months. Detection vs evasion is a perpetual cat-and-mouse with the cat lagging.
- Non-native English bias (Liang et al. 2023, Stanford): 7 of 7 popular detectors flagged 19-97% of essays by non-native writers as AI. This is a feature of training-data bias, not a fixable bug.
- Genre flattening: Highly formal genres (legal briefs, academic abstracts, medical reports) have intrinsically low perplexity and uniform burstiness because the GENRE demands it. Any detector flagging these will have high false positives.
Recommended best practices when using detectors
- Never rely on a single detector — use 2-3 in agreement
- Use span-level (sentence) scores rather than document averages
- Be especially cautious with non-native English writers, formal genres, and translated text
- Treat 50-70% confidence as "inconclusive" rather than positive — require 85%+ for action
- Pair detection with process signals (revision history, draft snapshots, viva-voce questioning) for high-stakes decisions
Related Eyesift resources
- Best AI Detectors 2026 — full comparison
- Complete Guide to AI Detection
- AI Detection Accuracy Benchmarks
- AI Detection False Positives
- Free AI Text Detector (Eyesift)
Sources: Mitchell et al. (2023) DetectGPT (NeurIPS); Kirchenbauer et al. (2023) Watermark for Large Language Models (ICML); Zellers et al. (2024) AI-Generated Text Detection in the Wild; Liang et al. (2023) GPT detectors are biased against non-native English writers (Patterns); Sadasivan et al. (2024) Can AI-Generated Text Be Reliably Detected? (TMLR); SynthID Text technical paper (Google DeepMind 2024). All listed numbers reflect published benchmark ranges; individual detector performance varies with input length, content domain, and model version.