Why Claude Can't Catch Its Own Tells: Cross-Model Adversarial Review

You ran the banned-phrase scan. Clean. You asked Claude to review the draft for voice drift. It came back clean. Published the article. Three days later a practitioner emailed: they clocked it as AI by paragraph two.

They were right. The tells weren’t in any phrase list. They were structural, a mechanical cadence in how examples were introduced, simulated-candor constructions the generating model didn’t register as tells because they aren’t tells from its perspective. From the model that produced them, those constructions are normal prose. Baseline, not deviation.

This is the structural flaw in LLM self-review. A model cannot detect its own tells because its tells are its training distribution. The patterns that read as AI-generated to an experienced reader don’t register as deviations inside the model that produced them. You’re asking it to evaluate against a ruler it used to write the thing.

Why the RLHF default register creates a detection blind spot

Post-training via reinforcement learning from human feedback pushes model outputs toward what human raters preferred during evaluation. Those raters were scoring prose across thousands of examples, and certain patterns scored consistently: clear topic sentences, forward transitions, balanced section openings, simulated transparency markers. Those patterns became the model’s default register, not a style choice, a learned baseline.

The result: when you ask Claude to identify AI-sounding prose in its own output, it can only catch the tells its training explicitly associated with AI generation. The structural patterns that are AI-typical because they’re also Claude-typical don’t register as abnormal. You’d need a model trained to recognize Claude’s specific defaults as unusual. Claude doesn’t have that frame. It is that frame.

The “just iterate the prompt” advice survives because nobody runs the regression. More specific instructions, clearer persona specs, longer constraint lists, none of these change what the model considers normal output at the generation level. They change what it produces in specific places. The underlying register persists.

What Gemini caught that Sonnet missed

We ran 4 persona-generated article sections through the production review pipeline, then submitted the same 4 outputs to Gemini 1.5 Pro with a single adversarial brief: find evidence this was written by an AI. Don’t rewrite anything. Flag what reads as generated.

Sonnet’s self-review had flagged minor coverage gaps and returned clean results on voice register. Gemini found 4 structural patterns Sonnet had not flagged, across all 4 outputs:

1. Mechanical execution of required rhetorical moves. The persona spec includes a set of concrete rhetorical actions the writer must perform in each section, concrete examples, summary closers. Claude executed all of them. But across all 4 outputs, the execution used near-identical sentence-opening cadences: the same lead-in construction for examples appeared in 3 of 4 sections, and one specific summary-closer construction appeared in every closing sentence (11 of 11). Gemini’s note: “structural rigidity, the pattern repeats with mechanical consistency that no human writer would sustain across this volume.” The requirement was met. The tell was in how uniformly it was met.

2. Simulated-candor hooks. Three of the 4 outputs contained constructions like “I should say upfront” and “to be clear about what we found”, phrases that announce transparency rather than demonstrate it. These aren’t in any standard phrase ban list. They’re a newer generation of the same tell, operating at a higher register. Gemini flagged all 3. Sonnet missed all 3.

3. Cross-persona structural uniformity. All 4 outputs used the same paragraph-length distribution: a medium-length opener, two or three body paragraphs, a closing sentence. Different topics, different personas, same scaffold. Gemini’s assessment: “the structural fingerprint is consistent regardless of voice spec, this suggests a single generator constrained by personas rather than genuinely different voices.”

4. No tangents. Every output stayed precisely on topic. No named detours, no off-thread asides that returned to the main line. The practitioner persona spec explicitly requires a hyper-specific tangent per article, a detour to a named experience before returning to the main line. Gemini observed the absence: the drafts read “like completions of a schema, not documents written by someone who got distracted and caught themselves.”

4 patterns. Sonnet caught 0. This isn’t a capability gap, it’s a structural one. Sonnet cannot evaluate its own defaults as defaults.

Why different training is the mechanism, not a workaround

Gemini and Claude were trained on different text corpora, fine-tuned with different human preference data, and post-trained by different teams with different evaluation rubrics. What registers as “good pedagogical prose” in Gemini’s training can register as “mechanical cadence” in Claude’s output distribution. The blind spots don’t align.

The overlapping coverage area, patterns both models flag, gives you the highest-confidence tells. The non-overlapping area, what one flags and the other misses, is where you learn something new about each model’s defaults. In the 4-output test, every pattern Gemini caught fell into that non-overlapping area. Sonnet wasn’t wrong to miss them. They weren’t deviations from Claude’s perspective. That’s precisely the problem.

One parenthetical worth making explicit: this doesn’t guarantee universally clean output. Gemini has its own tells, flatter sentence rhythm, more explicit logical connectives, and a trained reader can spot them. The goal is Claude-flavored output with Claude’s specific structural defaults removed, not an output that passes every possible detection frame.

This is also where the “just use the best model” recommendation breaks down. The model best at generation is, by definition, the model least suited for self-review. Its strengths are its blind spots. The review model needs to be different, not better. That distinction is the whole point, and it’s missing from nearly every LLM quality workflow we’ve seen documented.

The integration: critic-only, not co-author

The Gemini pass runs as a critic-only step. It does not rewrite anything. Gemini receives the draft and returns a structured list of detected tells, for each one, a type label, the verbatim passage flagged, and the reason it reads as generated prose. That list goes to Claude, which patches each flagged instance using the original draft as the base.

Two constraints determine whether this works:

First, the adversarial framing has to be explicit. “Identify AI-sounding prose” produces generic feedback. “Assume this was written by Claude. Find the evidence.” produces tells. The more specific the accusation, the more specific the output. We found that asking Gemini to frame each flagged item as evidence it would present to a panel produced more precise critique than asking it to review for naturalness.

Second, the patch step must rewrite at the sentence level, not paraphrase the flagged phrase. Replacing “Concretely:” with “Take this as an example:” breaks the exact-phrase match but preserves the structural-rigidity tell. The fix needs to dissolve the underlying pattern, not rename the surface trigger. When Gemini flags mechanical cadence across 11 summary closers, the patch isn’t to vary the opening word, it’s to vary the structural role the sentence plays in the paragraph.

The output stays Claude-flavored throughout. Gemini never touches the prose. Claude rewrites with explicit knowledge of what Gemini found AI-typical, knowledge it couldn’t have generated on its own. The final editorial layer that AI search evaluates and the layer human readers notice are the same layer: structured prose that moves like a person wrote it.

Cost and the calibration flywheel

The Gemini critique pass adds roughly $0.15 per article at current output token volumes (small sample, costs shift with model updates and output length). For teams pricing premium content above $150 per article fully loaded, the math is trivial. For high-volume teams processing 50+ pieces per month, a targeted pass on high-stakes outputs is the more defensible position than running it on every draft.

The more durable value is calibration data, not per-article cleanup. Track what Gemini flags consistently across 20 to 30 outputs. The patterns that recur, the candor-hook construction, the mechanical cadence of required-move execution, can be hardcoded into your phrase ban filter and persona spec constraints. Once Gemini has flagged “I should say upfront” across a dozen outputs, you don’t need Gemini to catch the next instance. You catch it programmatically, at zero marginal cost per article.

The adversarial pass is both a quality gate and a training mechanism for your own detection system. Every consistent flag it produces is a rule waiting to be extracted. The structure of what actually ranks tells you what SERP-level signals matter; this tells you what the prose-level signals are doing underneath, the ones that determine whether a practitioner keeps reading past paragraph two.

The general pattern beyond content

Adversarial cross-model review is a broadly applicable pattern for any team running LLMs in production. The logic transfers:

Code review. A model trained on different open-source corpora surfaces assumptions baked into the generating model’s preferred patterns, error-handling conventions, library choices, variable naming habits, that the generating model considers unremarkable. Not better conventions. Different conventions. Different is signal.

Legal document review. The simulated-candor problem appears in legal writing as assertions of good faith that a reader trained on adversarial contract analysis will immediately distrust. A model trained on litigation transcripts flags different constructions than one trained on contract drafting guides. Same underlying logic.

Marketing copy. A model trained on advertising effectiveness data flags word-choice patterns the generating model considers normal. The specific flags depend on the divergence between training corpora, which is why different model families produce different useful critiques of the same copy, not because one is smarter, but because the training-distribution gap produces different coverage. The way AI search probes your content for citation-worthiness works on the same adversarial-coverage principle: multiple vectors checking whether the same document resolves different facets of a query.

In any domain where output quality depends on deviation from a statistical mean, the most useful QA system is one trained on a different mean. That’s not a clever trick. It’s the only way to detect what your generating model considers normal.

Frequently Asked Questions

Does the order matter, should Gemini review before or after generation?

The adversarial pass runs after generation, not before. Using Gemini to pre-review the brief or prompt produces different signal: structural gaps in the spec, not tells in the output. The detection value comes from reviewing actual prose. If you’re also running a coverage review pass, it can run in parallel; the two passes check different things and don’t interfere.

Does this work for code review?

Yes, with an adjusted framing. For code, the adversarial prompt should target convention assumptions rather than prose tells: “assume this was written by a model trained on GitHub’s most-starred repositories, find where that training shows.” The output surfaces library preferences, abstraction patterns, and error-handling conventions that may diverge from your house style. More useful as a linting-by-example pass than as a correctness review.

Can open-source models substitute for Gemini in this workflow?

In principle, yes, any model with a meaningfully different training distribution produces a different detection surface. In practice, the quality of the critique correlates with the model’s capacity to reason about prose structure, not just pattern-match against surface phrases. Smaller models (sub-7B parameters) tend to produce flat critiques that catch explicit tells but miss structural ones like the cadence rigidity Gemini caught. Larger open-source models produce more useful signal, though the infrastructure overhead changes the economics relative to a paid API call.

Is this just consensus voting between models?

No. Consensus voting produces the intersection of agreement, what both models think is correct. Adversarial cross-model review deliberately targets disagreement. The goal is to surface what one model considers normal that the other considers a tell. The information is in the difference, not the agreement. When both models flag the same passage, that’s useful data about a high-confidence tell; the novel value is in what only one model flags, because that’s where the training-distribution gap is widest.

Does Gemini flag different patterns depending on which persona was used?

In our testing, yes, but the structural-uniformity tell appears regardless of persona. The cadence rigidity and cross-persona scaffold consistency are generation-level artifacts, not persona-level ones. Persona-specific tells vary: simulated-candor hooks appear more in practitioner-voice outputs, mechanical example-lead patterns appear more in educator-voice outputs. The cross-persona structural tells don’t vary, which makes them the most reliable calibration signal. They appear in every output batch regardless of voice spec configuration.