We Tested Our AI Content Against Every Major Detection Tool. Here’s What Actually Works.


The AI content detection arms race is heating up. As more businesses adopt AI-powered content creation, detection tools have become increasingly sophisticated, and frankly also paranoid. We spent the past few days running an analysis of our content through multiple detection platforms, testing different AI models, and measuring exactly how effective our humanisation pipeline really is. We plan to do a lot more of this.

The results surprised us. Here’s the full breakdown.

The Problem Every AI Content Creator Faces

If you’re using AI to generate content at scale, you’ve probably experienced the anxiety of wondering whether your articles will get flagged. Google hasn’t officially penalised AI content, but the reputational and editorial risks remain real. Clients want assurance. Editors want peace of mind. And detection tools are everywhere.

The challenge is that these tools don’t agree with each other. An article that sails through ZeroGPT might get hammered by GPTZero.me. Content that Quillbot barely notices could trigger alarm bells elsewhere. This inconsistency makes it genuinely difficult to know whether your content is “safe” or not.

We decided to stop guessing and start measuring.

Our Testing Methodology

We selected eight articles across English and German, generated using two different large language models: Claude Opus 4.5 and GPT 5.2. Each article was run through three major detection platforms, ZeroGPT, Quillbot’s AI detector, and GPTZero.me, at three different stages of our pipeline.

The first measurement captured the raw AI output, straight from the model with no modifications. The second measurement came after running the content through our humanisation process. The third measurement tested the same humanised content with all markdown formatting stripped out, since we’d noticed that some detection tools seem to weight formatting patterns in their analysis.

This gave us a matrix of 72 data points to analyse: eight articles, three tools, three pipeline stages.

What We Learned About Detection Tools

The first thing that became immediately clear is that detection tools have wildly different sensitivities and methodologies.

ZeroGPT proved to be the most lenient of the three. Initial detection scores ranged from as low as 7.51% to as high as 85.72%, with significant variance depending on the source model and content type. This tool seems to weight certain linguistic patterns heavily, which means some AI content passes almost undetected while other pieces get flagged aggressively.

Quillbot’s detector landed in the middle ground. We saw initial scores typically ranging from 14% to 88%, with more consistency than ZeroGPT but less severity than GPTZero.me. Interestingly, Quillbot appeared particularly sensitive to markup and formatting, stripping these elements often produced meaningful score reductions.

GPTZero.me was the strictest by a considerable margin. Every single piece of Opus 4.5 content scored 100% on initial generation. GPT 5.2 fared slightly better, occasionally dipping to 76%, but the baseline assumption from this tool seems to be that any fluent, well-structured content is probably AI-generated. This creates obvious challenges for legitimate use cases.

Opus 4.5 vs GPT 5.2: A Tale of Two Models

Comparing the two models revealed an interesting trade-off between consistency and ceiling performance.

GPT 5.2 produced more consistent initial scores across all detection tools. On ZeroGPT, scores clustered between 24.98% and 67.37%, no dramatic outliers in either direction. This predictability makes it easier to set expectations and plan content strategies around known thresholds.

Opus 4.5, by contrast, showed much higher variance. Some articles scored remarkably low on ZeroGPT (7.51%), while others hit 85.72%. This unpredictability is a double-edged sword: you might get lucky with content that flies under the radar, or you might end up with pieces that need significant rework.

Where things got interesting was post-humanisation performance. On ZeroGPT and Quillbot, GPT 5.2 maintained its consistency advantage, settling into tight ranges of 13-23% and 4-35% respectively. But on GPTZero.me, the toughest tool, Opus 4.5 actually humanised better, dropping from 100% to as low as 29%, while GPT 5.2 content remained stubbornly elevated at 58-82%.

The takeaway? Model selection should depend on which detection tools your audience or clients prioritise.

The Humanisation Pipeline Works

The most important finding from this entire exercise is that our humanisation process delivers consistent, measurable improvements across every model and detection tool combination we tested.

On ZeroGPT, we saw Opus content drop from averages around 33% to 22% post-humanisation, with GPT 5.2 falling from 52% to under 19%. Quillbot scores improved even more dramatically, Opus content went from 70% average detection to around 31%, with some articles reaching single digits. Even the notoriously strict GPTZero.me showed meaningful movement, with Opus content dropping from guaranteed 100% detection to an average of 43%.

We also confirmed that markup matters. Stripping formatting consistently improved scores across all tools, sometimes by 10-15 percentage points. This suggests that detection algorithms are partially keying on structural patterns rather than purely linguistic signals.

The bottom line: the pipeline is doing exactly what it’s supposed to do. Raw AI content is risky. Humanised content performs dramatically better across the board.

Why We’re Building Around Ahrefs, Not Surfer SEO

As we’ve developed SEOZilla’s content automation capabilities (and also our sister site Teralios.de), we’ve had to make strategic choices about which SEO platforms to integrate with most deeply. Both Surfer SEO and Ahrefs are excellent tools with passionate user bases, but we’ve chosen to focus our efforts on Ahrefs integration.

The reasoning comes down to data depth and workflow alignment. Surfer SEO excels at on-page optimisation and content scoring, it tells you how well a specific piece matches search intent and competitor patterns. But Ahrefs provides the comprehensive keyword research, backlink analysis, and competitive intelligence that inform content strategy at a higher level. For businesses generating content at scale, understanding which topics to pursue matters as much as optimising individual articles. Ahrefs’ database depth, historical tracking, and site audit capabilities make it the stronger foundation for enterprise content operations. When you’re producing dozens or hundreds of articles monthly, you need the strategic layer that Ahrefs provides, not just page-level optimisation scores.

What This Means For Your Content Strategy

If you’re producing AI content at scale, here’s what our testing suggests you should do.

First, don’t rely on a single detection tool for validation. The disagreement between platforms means that passing one test guarantees nothing about the others. Test against multiple tools, or at minimum, test against the strictest one (currently GPTZero.me) to establish a realistic baseline.

Second, humanisation isn’t optional, it’s essential. Raw AI output, regardless of which model you use, carries significant detection risk. A proper humanisation pipeline reduces that risk by 30-70% depending on the tool and content type.

Third, consider your formatting. If detection scores matter to your workflow, test both with and without markdown or HTML formatting. The differences can be substantial.

Finally, match your model to your priorities. If you need predictable, consistent results, GPT 5.2 is currently the safer choice. If you’re optimising for the toughest detection tools and can tolerate some variance, Opus 4.5 might deliver better ceiling performance.

The AI content landscape will keep evolving. Detection tools will get smarter. Models will get better at mimicking human writing. What won’t change is the need for rigorous testing, continuous measurement, and pipelines that actually deliver results.

We’ll keep testing. We’ll keep measuring. And we’ll keep sharing what we learn.

SEOZilla

SEOZilla is an AI SEO platform designed to automate and scale search-engine optimization for agencies and businesses. The SEOZilla ecosystem includes autonomous AI agents for content creation, WhiteLabelSEO.ai for agencies needing a complete white label SEO solution, and SEOContentWriters.ai for Human+AI content production. Together, these tools generate optimized content, streamline keyword workflows, and support high-quality editorial output. SEOZilla helps agencies keep clients in-house, increase efficiency, and deliver stronger organic results. With AI-powered automation and expert SEO content writers, the ecosystem provides a scalable, future-ready approach to modern SEO.

Read more from SEOZilla

AI content is everywhere now. Most marketers use it in some form, even if they don’t always say it out loud, it’s usually an open secret. At the same time, tools like the Ahrefs AI Detector are getting sharper and often more confident in their decisions, flagging patterns faster than before (and sometimes a bit too fast, honestly). That mix puts real pressure on teams that need to move quickly. Tight deadlines. Big growth goals. It also brings up worries about staying safe in search results...

From an automation perspective, slug length affects how your CMS saves and displays AI-created links. The best approach is for your AI tool to follow solid SEO habits: choose the most relevant words, keep the brand’s tone, and cut any extra bits that push keywords out of view. This small technical tweak can help boost visibility and make sharing smoother. The Optimal Length of Blog Slug for AI If you’re trying to figure out the “sweet spot,” most SEO experts suggest slugs in the 2, 3 word...

Digital marketing teams are juggling a tricky balance right now, trying to boost organic traffic without wearing out their content teams (and yes, burnout can sneak up faster than you think). Search engines often change the rules with little notice, and those “surefire” SEO tips from just last year? They’re often already old news. This is why the hunt for the best SEO SaaS tools has become such a priority—modern SEO SaaS platforms save time by exploring keyword research, polishing content,...