EyeSift
Research AnalysisMay 5, 2026· 18 min read

How to Check If Text Is AI-Generated

Reviewed by Brazora Monk·Last updated May 5, 2026

A research analyst's breakdown of detection methods, tool accuracy benchmarks, and the manual signals that even automated detectors miss — for educators, publishers, and HR professionals who need reliable answers.

Key Takeaways

  • No single detector is definitive. The best tools (Originality.ai, GPTZero) achieve 94–99% accuracy on unmodified AI text, but accuracy drops to 2–8% once text has been processed through a humanizer. Use detectors as a signal, not a verdict.
  • Two statistical signals drive all automated detection: perplexity (word predictability) and burstiness (sentence length variation). Understanding both lets you interpret detector output and supplement it with manual analysis.
  • False positives are a documented problem. Stanford HAI (Liang et al., 2023) found 61.22% of non-native English essays misclassified as AI. Turnitin documents a 4% sentence-level false positive rate. Treat any positive detection result as a hypothesis to investigate, not a conclusion.
  • Manual detection cues remain valuable for catching AI text that has been processed through humanizer tools — because manual analysis targets content-level signals (unverifiable citations, absence of personal detail, argument over-resolution) that statistical tools cannot measure.
  • The most reliable workflow combines automated detection with targeted follow-up questions — asking the author to explain their argument, discuss their sources, or describe their process. Human authors can do this; AI-generated content without human engagement cannot survive it.

Start with the data point that reframes everything: according to independent benchmarking by AIDetectors.io (2026), when AI text is processed through a quality humanizer tool before submission, detection rates drop to 2–8% across all major platforms — including Originality.ai, which achieves 94% accuracy on unmodified AI text. That gap defines the challenge. Automated detection works well under one condition (unmodified AI output) and poorly under another (post-processing). Understanding this limitation shapes every practical decision about how to check for AI-generated text.

This guide covers what automated detectors actually measure, how to read and interpret their outputs, which tools perform best under which conditions, the manual signals that automation misses, and the workflow that produces reliable conclusions rather than false confidence in either direction.

What Automated Detectors Actually Measure

Every AI text detector — regardless of brand or price point — builds on the same two foundational signals. The marketing differs; the mechanism does not.

Perplexity: How Predictable Are the Word Choices?

Language models generate text by selecting, at each position, the statistically most probable next token given everything that preceded it. This produces text that is statistically expected — low-perplexity in formal terms. Every word in an AI-generated sentence is, by design, close to what any reader familiar with the training corpus would predict.

Human writing is different. People reach for less expected synonyms, shift registers mid-sentence, make choices influenced by personal experience that a language model has not modeled. A human expert might write “the data pointed somewhere uncomfortable” where a model would write “the data indicated significant findings.” The human sentence is higher-perplexity — less statistically predictable — and detectors flag low-perplexity text as a potential AI signal.

Per GPTZero's public methodology documentation, perplexity is one of their two primary signals. It is measured using a reference language model — typically one trained on a broad text corpus — to score how probable each word sequence is. The lower the average perplexity across a document, the higher the AI probability score.

Burstiness: Does the Sentence Structure Vary?

Burstiness measures variance in sentence length and structural complexity across a document. Human writing is highly bursty: writers naturally alternate between short, punchy sentences and long, complex ones. A paragraph written by a person often contains a three-word sentence followed by a forty-word sentence with three subordinate clauses. The variation is rhythmically natural and statistically distinctive.

AI-generated text is suspiciously uniform. A language model operating at a consistent temperature produces sentences of similar length, similar syntactic complexity, and similar information density throughout. Every paragraph feels equally weighted. Detectors operationalize this observation as a burstiness score — documents with high variance in sentence length read as human; documents with low variance read as AI.

Advanced classifiers — particularly Turnitin's AIW-2 model and Originality.ai's ensemble — also analyze discourse marker patterns (how transitions and connective language are used), argument structure coherence, and paragraph-level organization. These secondary signals extend detection beyond the sentence level to the document level.

AI Detector Accuracy Benchmarks: 2026

ToolAccuracy (Unmodified AI)False Positive RateAccuracy (Humanized)Best Use Case
GPTZero~99%~15% (university essays)~18%Educational / academic
Originality.ai94%6%7.8%Publisher / content marketing
Winston AI~95%~8%~12%Formal documents, reports
Turnitin (AIW-2)High (undisclosed)<1% (doc-level); 4% (sentence)ModerateAcademic integrity
ZeroGPT~85%16.9% (RAID benchmark, MIT CSAIL)~5%Quick free checks
EyeSiftHighLowModerateMulti-signal, sentence-level

Sources: AIDetectors.io 2026 benchmarks; GPTZero public methodology; Turnitin model architecture documentation; MIT CSAIL RAID benchmark study (2024); independent accuracy audits. Humanized accuracy = detection rate after processing through leading humanizer tools.

The numbers reveal two things. First, top-tier tools are genuinely effective against unmodified AI text. Second, the entire accuracy landscape collapses against humanized content. Originality.ai — the strongest performer in most benchmarks — catches only 7.8% of AI text that has been through a humanizer. GPTZero's 18% is better, but still means roughly four out of five humanized AI documents pass undetected.

This is not a flaw in the tools. It is a structural limitation: when AI output is edited to have higher perplexity and burstiness, it genuinely looks more human on the signals detectors measure. The solution is not a better detector — it is a more comprehensive detection approach that combines statistical tools with manual analysis and human verification.

Step-by-Step: How to Check Text for AI Generation

Step 1: Run a Multi-Tool Statistical Check

Never rely on a single detector. Different tools weight signals differently and use different training data. A text that scores 85% AI on GPTZero may score 40% on ZeroGPT because their classifiers are built on different corpora. Running the same text through two or three detectors and looking for concordance — multiple tools flagging the same passages — is more reliable than trusting any single result.

Specifically: run the text through EyeSift's sentence-level detector to identify which specific passages drive the overall score. Sentence-level granularity tells you where the AI pattern is concentrated — which is actionable information whether you are editing the text or investigating the authorship.

Step 2: Check the False Positive Context

Before treating a positive detection result as evidence of AI authorship, consider the demographic and stylistic context. The Stanford HAI study (Liang et al., 2023) is the definitive research here: non-native English speakers are misclassified as AI at a 61.22% rate across seven detectors. Native English students in the same study faced less-than-10% false positive rates.

Formal academic writing, technical documentation, and legal prose from experienced human writers also systematically triggers false positives — because expertise in formal writing produces low perplexity and low burstiness. If the text is from someone who writes formally by training or culture, a positive detection result requires additional investigation before any conclusion.

Step 3: Apply Manual Detection Cues

Manual detection targets content-level signals that statistical classifiers cannot measure. These cues survive humanizer processing because they reflect what the text says, not how it is structured.

Unverifiable or suspicious citations. AI models hallucinate. They generate plausible-sounding citations — journal names, author names, study titles — that do not exist. Look up every cited source. A paper that cannot be found in Google Scholar, PubMed, or JSTOR is a strong AI signal. This is the most reliable manual detection method because it identifies something a human author would not do: fabricate their own bibliography.

Discourse marker density. Words like “furthermore,” “moreover,” “it is worth noting,” “delve,” “leverage,” “paramount,” and “seamlessly” are dramatically overrepresented in AI-generated text relative to authentic human prose. Per Pangram Labs' 2025 analysis of detection failure modes, the co-occurrence pattern of these markers in a single document is a strong AI signal even after humanizer processing, because humanizer tools do not consistently remove all discourse markers.

Over-resolution of arguments. AI writing almost never leaves a genuine loose end. Every paragraph begins with a topic sentence, makes a claim, provides evidence, and resolves the point before moving on. Human writing — especially expert human writing — is structurally messier. It revisits ideas, leaves qualifications unresolved, or acknowledges complexity it cannot fully address. Text that is suspiciously clean in its argumentation structure is worth flagging for further investigation.

Absence of specific personal or professional experience. Human writers, particularly in professional and academic contexts, tend to anchor claims in specific experiences: a particular project, a named colleague, a remembered case. AI models produce universalizing, impersonal prose: “organizations typically find that...” “researchers generally agree...” “most professionals report...” The absence of any specific, personal, unverifiable-but-plausible detail is a content-level AI signal.

Step 4: Use Targeted Follow-Up Questions

For educational and professional contexts, the most reliable verification method is direct: ask the author questions about their work. Not questions that can be answered by rereading the text, but questions that require genuine understanding:

  • Can you explain the methodology of the study you cited in section two?
  • What made you structure the argument this particular way — what alternatives did you consider?
  • How did you find the data in the third paragraph, and what was the original source?
  • What is your opinion on the counterargument — why did you address it the way you did?

Human authors who genuinely engaged with their work can answer these questions in their own voice, with personal reflection. AI-generated content submitted without meaningful human engagement cannot produce answers to questions requiring actual knowledge of process, source, and reasoning. This is why the “oral defense” model — asking students or applicants to discuss their submitted work — is increasingly considered the gold standard for AI authorship verification.

Model-Specific Detection Rates: ChatGPT vs. Claude vs. Gemini

Detection rates vary meaningfully by the underlying model that generated the text, which matters for educators and publishers targeting specific tools.

Per 2026 benchmarking, ChatGPT (GPT-4o and successors) generates the most detectable output — with detection rates of 92–96% on leading detectors. ChatGPT's training optimizes for coherent, expected prose at default temperature settings, which produces reliably low perplexity. Claude models generate text that is somewhat harder to detect (75–85% detection rate), partly because Anthropic's Constitutional AI training produces stylistically different output from RLHF-trained OpenAI models. Gemini shows the most variability, with detection rates ranging from 70–82% depending on model version and prompt structure.

These detection rate differences matter less than they might appear, because they all collapse against humanizer processing. The practical implication is that model-specific detection is more valuable as a baseline than as an ongoing strategy — once humanization is involved, model attribution becomes speculative.

When Detection Results Are and Are Not Actionable

The most important contextual judgment in AI detection is determining what a result can and cannot support. Being clear about this protects both the people investigating and the people being investigated.

Detection results can appropriately support: initiating a follow-up conversation, requesting additional documentation, applying heightened editorial scrutiny, requesting an oral explanation of submitted work, or flagging text for secondary human review.

Detection results cannot appropriately support: determining guilt or misconduct as a standalone finding, rejecting job applications without human review, failing students without additional evidence, or making legal claims about authorship. Turnitin's documentation explicitly states that AI detection scores should not be used as sole evidence in academic integrity proceedings. The same principle applies in all contexts.

This is not a limitation of the technology that will be solved by a better tool. It is an inherent property of probabilistic classification: a classifier that achieves 94% accuracy produces 6% false positives, and in a context with many legitimate human authors, the false positives matter. The false positive problem is well-documented and should be part of every detection workflow's design.

Practical Tool Selection by Context

For educators and academic institutions: Turnitin remains the standard for formal academic integrity processes because of its institutional integration, its well-documented model architecture, and its explicit guidance against using scores as standalone evidence. GPTZero is a strong supplementary tool for sentence-level granularity. Neither should be used without a human review process for contested results.

For publishers and content teams: Originality.ai delivers the best accuracy-to-cost ratio for bulk content review, with multi-document scanning and plagiarism integration. It is less suitable in ESL author contexts due to known false positive bias. The full publisher guide to AI detection tools covers workflow integration in more detail.

For HR and recruiting: AI detection of application materials should be treated as a flag for follow-up, not a disqualification criterion. The SHRM State of AI in HR 2026 report found AI adoption in HR functions at 43% — the irony being that HR departments increasingly use AI to process applications while also questioning AI use in applications. A candidate who used AI to polish a cover letter that accurately represents their experience is different from a candidate who fabricated qualifications. That distinction requires human judgment to make.

Frequently Asked Questions

What is the most accurate free AI detector?

GPTZero offers a free tier with strong accuracy (~99% on unmodified AI text) and sentence-level highlighting. ZeroGPT is fully free but has a higher false positive rate (16.9% in MIT CSAIL's RAID benchmark). EyeSift provides free sentence-level AI detection with multi-signal analysis. For paid tools, Originality.ai at $0.01/credit is the accuracy leader at 94% with only 6% false positives in independent 2026 benchmarks.

How do I detect AI text that has been paraphrased or humanized?

Statistical tools are largely ineffective against well-humanized AI text — accuracy drops to 2–8% in independent benchmarks. The reliable methods are content-level: check citations (AI hallucinations survive humanization), look for discourse marker clusters (furthermore, moreover, leverage, delve), and apply the follow-up question test — ask the author to explain their argument and sources in their own words. Humanized text can fool perplexity detectors but cannot produce genuine engagement with specific source material that was never consulted.

Can I tell if an email was written by AI?

Email is particularly hard to detect algorithmically because it is often short (under the minimum word counts detectors require for reliability) and AI-assisted email editing is extremely common. Manual signals apply: overly formal transitions in casual context, generic closings like “I hope this email finds you well,” and unusually comprehensive coverage of a topic in a context where brevity would be expected. Statistical tools work better on emails over 200 words; below that threshold, manual judgment is more reliable than any detector.

Does Turnitin detect Claude, Gemini, and other non-ChatGPT models?

Yes. Turnitin's AIW-2 model was trained on output from multiple AI systems, not just ChatGPT. Detection rates vary by model: ChatGPT output is most detectable (92–96%); Claude output shows 75–85% detection rates; Gemini ranges from 70–82%. Turnitin updates its model periodically as new AI systems are released, though the update cadence typically lags the deployment of new models by several months — which creates a window where very new model output may be harder to detect.

What should I do if someone is falsely accused of AI writing?

First, obtain the specific detection result — which tool, what score, which passages were flagged. Then request that the institution or reviewer follow their own policy: Turnitin explicitly states scores are not standalone evidence. Provide supporting documentation: prior drafts, research notes, browser history, the typed revision process. Request an oral review — the ability to discuss sources and argument in detail is the most defensible evidence of genuine authorship, and false positives collapse when authorship can be demonstrated through substantive engagement with the work.

How many words does text need to be for AI detection to work?

Most detectors require a minimum of 150–300 words to produce reliable results. Turnitin requires 300 words and explicitly notes that shorter documents produce less reliable scores. GPTZero's accuracy improves significantly above 250 words. Below these thresholds, the statistical sample is insufficient to establish a reliable pattern — which is why email and social media posts are poorly served by automated detection tools.

Will AI watermarking solve the detection problem?

Potentially — but only for output from models that embed watermarks. A 2024 University of Maryland study found cryptographic watermarks survived 50% token substitution at 95% detection accuracy, suggesting watermarks are more robust than statistical signals. OpenAI, Google, and others are implementing watermarking, but adoption is uneven and detection requires access to the watermark key. Watermarking also does nothing to address false positives on genuinely human text — it solves a different problem from statistical-signal-based detection.

The Practical Conclusion

Checking whether text is AI-generated is a workflow problem, not a tool problem. The tools are useful — genuinely so, for unmodified AI text — but they are neither sufficient alone nor reliable in all contexts. The workflow that produces defensible conclusions combines automated detection (for baseline statistical signals), citation verification (for the content-level signals automation misses), and human conversation (for the authorship verification that no algorithm can replicate).

Treat detection results as the start of an investigation, not the end. Use the sentence-level highlights to focus your manual analysis. And apply the false-positive context check — particularly for non-native speakers, formal writers, and technical documents — before drawing conclusions from any single score.

Check Any Text for AI — Free, No Signup

EyeSight's AI detector provides sentence-level scoring so you can see exactly which passages read as AI-generated — not just an overall percentage.

Run AI Detection Check