Stop Calling It Tone — How to Actually Spec a Brand Voice for AI Writers

Q: Do we need a different voice spec per audience segment?

Voice — the structural substrate — usually stays constant across segments. Tone adapts. A single voice spec can have a cadence line and required moves that work across all segments, while the generation prompt adapts the register per context.

Every content team I’ve consulted in the last three years has the same document in their Notion. It’s called something like “Brand Voice Guidelines” and it says things like: be approachable, professional, and human. Some add a second bullet: avoid jargon. A few include: inject personality. That document is useless for AI writing. Not useless in general, it describes something real. But “be approachable” gives an LLM nothing it can execute against. There is no programmatic check for approachable.

The problem isn’t the desire to maintain brand voice at scale. That instinct is right. The problem is the tool teams reach for: a traditional style guide written for human writers who can interpret adjectives and apply judgment. AI doesn’t interpret. It predicts the next token based on training distribution. Without executable constraints, it defaults to the average of public-web prose, which is exactly what your brand doesn’t sound like.

Why Adjective-Based Guidelines Break at Scale

I’ve watched this unfold at a fintech startup I worked with in 2022, a founding team with a distinctive, tight-copy voice. Their content lead wrote a solid three-page style guide. When they started using AI for blog output in early 2023, the first six articles were fine. The seventh wasn’t. By the fifteenth, the voice had drifted far enough that a co-founder flagged it in Slack. “This reads like every other SaaS company,” she wrote.

The style guide was intact. They’d even pasted it into the system prompt. It made no difference.

The reason is structural. When a model is trained with reinforcement learning from human feedback, human raters consistently prefer text that sounds clear, professional, and slightly warm, which is exactly what every “approachable” style guide describes. The model already wants to write that way. It doesn’t need permission. Research published at ACM Creativity & Cognition 2024 showed that LLM assistance pulls creative output toward a statistical center, participants working with AI produced work measurably more similar to each other than participants in the control group. Brand voice is a deviation from the mean. Your style guide described that deviation in words. The model couldn’t hold it.

The mistake most operators make is treating voice as a set of adjectives to describe and a list of words to avoid. That’s a starting point, not a spec. A real spec is operational, it gives the writer, human or LLM, concrete actions to take, not a vibe to aim at.

Tone Is Situational. Voice Is Structural.

I’ve seen this conflation cost teams months of rework. The two words get used interchangeably and the distinction matters operationally.

Tone is context-dependent. The same brand might be empathetic in a support interaction, confident in a sales email, and dry in a product changelog. Those are all appropriate, and none of them is “the brand voice.” They’re tonal adaptations within a consistent voice.

Voice is the rhetorical substrate that stays constant across contexts. It’s the sentence length your readers have come to expect. The specific kinds of things you name (tools, years, companies, never vague “some experts”). Whether you open with the subject of the sentence or a transitional adverb. Whether you use fragments. What words your writers reach for versus the ones that feel off even when they’re technically fine. That consistency is voice. Unlike tone, it’s specifiable in terms that can be enforced.

Take two B2B SaaS companies covering the same feature announcement:

Version A: “We’re excited to announce the launch of our new analytics dashboard, which provides a full set of real-time insights designed to help teams make better decisions faster.”

Version B: “New: the analytics dashboard is live. It shows you what’s moving, in real time. No setup required.”

Both are grammatically correct. Both are “professional.” They aren’t the same voice. The writer behind Version B didn’t achieve that by following a rule that said “be concise.” They achieved it because their voice spec includes a required move: product announcements lead with the outcome not the feature, enthusiasm openers are banned, one-sentence paragraphs are allowed. That’s enforceable. “Be concise” is not.

What a Real Voice Spec Looks Like

The spec I build with teams has four components. Each one produces pass/fail criteria, something a review system can check programmatically, not a vibe a human infers.

Required rhetorical actions. Concrete moves the writer must perform. Not suggestions. Testable, verifiable actions that define how the voice behaves across every section. For the fintech startup I mentioned, the three moves that recovered their founding voice were: (1) every section opens with first-person observation, “I’ve seen this break when…”, not a topic sentence; (2) every section names a specific entity, a real tool, company, or year, never anonymous filler; (3) each article directly names a mistake pattern: “The error most teams make here is…”

Those three moves were recoverable from reading their best existing content. They hadn’t written them down, that’s why the voice evaporated when the team changed and AI tooling arrived.

Banned phrases. Not “avoid jargon.” Specific phrases the brand doesn’t use. The list should be long enough to have teeth and it needs to grow as you observe what AI gravitates toward. A useful list starts with the universal AI-register phrases, the frictionless-everything marketing words and filler intensifiers that signal machine-generated text to any experienced reader, and then adds brand-specific avoidances. For a fintech brand that prizes directness, phrases like “ensure the success of” and “the leading provider of” often make the list. The point isn’t arbitrary vocabulary restriction. Every banned phrase is a guard against the statistical center the model wants to occupy.

Cadence line. One sentence describing the rhythm of the prose. Something like: “Short declaratives, occasional fragment for emphasis, paragraphs open with the subject not a transitional adverb.” This constraint produces recognizable sentence-level patterns across all content the brand ships, the thing readers notice without being able to name.

Worked examples. Three to five sentences that demonstrate the voice in practice, chosen for rhythm rather than topic. Not descriptions of the voice, demonstrations. “I’ve watched six different teams ship the same kind of process. Three failed at the same handoff.” That sentence pair gives a model more useful information about the expected voice than a paragraph describing what the voice should feel like. The model matches rhythm; it doesn’t obey adjectives.

Pull these four components from your best existing content, not from an idealized aspirational guide, but from the pieces people in your company point to and say “this is us.” That’s your working material. A voice spec extracted from real copy is more accurate than one invented from brand ambitions. See the content brief guide for how this kind of voice context flows into the brief layer before a word is written.

Where the Spec Lives in Production

Writing the spec is the easy part. I’ve placed specs in different parts of the prompt stack and the position matters as much as the content.

A voice spec injected as a simple preamble to a generation prompt holds for roughly the first 600–800 words of output. Past that, context window priority shifts and the model starts reverting. This is consistent with how attention weights operate in transformer architectures: the constraint was there when generation started, but by section 3 it is no longer the most recent and relevant signal.

The fix is prompt architecture, not a longer spec: break generation into sections, re-inject the voice constraints at each section boundary, and run a review pass afterward that checks for the required moves and banned phrases. Three separate enforcement points in the same pipeline, not one long system prompt. A single-pass generation with a pasted style guide is how you get the first six articles sounding fine and the fifteenth sounding like everyone else.

The review pass is the piece most teams skip. Without it, you catch banned phrases by eye after the fact, but you can’t reliably audit whether the required rhetorical moves were executed. A pass that checks both, required moves present, banned phrases absent, closes the loop. That check is what separates a voice spec from a style guide. The spec generates measurable pass/fail criteria. A style guide cannot.

The banned-phrase list needs maintenance. AI writing tools evolve; phrase patterns that signaled machine-generated text a few years ago have proliferated enough that they’re now simply average internet writing. Check the list quarterly. The goal is never a static document, it’s an accurate map of the statistical center your brand is trying to stay away from.

Why Tone Sliders Don’t Solve This

I’ve tested most of the brand-voice features on the major AI writing platforms. The gap between what they advertise and what they enforce is significant.

Most tools that advertise “brand voice” offer some version of a tone dial. “Formal to casual.” “Technical to accessible.” Sometimes a text box: “paste your style guide here.” Jasper’s brand voice product lets you configure characteristics through 40+ tone attributes. HubSpot’s AI content setup walks you through voice characteristics via an AI-assisted configuration. These are useful onboarding flows. They are not enforcement mechanisms.

A tone attribute like “conversational” produces a statistical shift toward the casual end of the training distribution. That’s not your brand’s specific voice. It’s a direction in a shared space your brand occupies with thousands of other “conversational” brands that all picked the same setting. The output sounds conversational. It doesn’t sound like you.

The gap between a tone slider and a persona spec with required moves and a programmatic review pass is where most content teams are currently losing. They got the tool. They didn’t build the constraint system. The output is detectable AI writing in the right general register. That’s not brand voice. That’s commodity content with your label on the jar.

Tools that treat the persona specification as a first-class object, where required moves drive generation, a review pass checks for their presence, and a banned-phrase filter runs programmatically, produce different output than tools that apply a tone dial and ship. The difference isn’t visible in the first paragraph. It accumulates across 15 articles over three months. That’s the test worth running.

The comparison of AI content brief tools covers how different platforms handle voice enforcement at the operational level, which ones treat it as a first-class constraint versus an afterthought setting.

Frequently Asked Questions

How is a voice spec different from a style guide?

A style guide describes how you want your brand to sound using adjectives and general rules. A voice spec specifies concrete rhetorical actions, a banned-phrase list, a rhythm description, and worked examples, all of which can be programmatically checked. The key difference: a style guide produces a document a human interprets. A voice spec produces pass/fail criteria a review system can enforce.

Can we extract a voice spec from existing content?

Yes, and it’s usually more accurate than writing one from scratch. Pull your 5–10 best-performing or most characteristic pieces. Look for the rhetorical moves that recur: do sections open with first-person? Are specific entities named? What phrases appear in your best content but feel wrong in generic AI output? Those observations become the required-moves list. The phrases that don’t appear in your best content but keep showing up in AI output become the banned-phrase seed list.

Do we need a different voice spec per audience segment?

Voice, the structural substrate, usually stays constant across segments. Tone adapts. A single voice spec can have a cadence line and required moves that work across all segments, while the generation prompt adapts the register per context. If you’re building fundamentally different required-moves lists for different audiences, you may have a positioning problem rather than a voice spec problem.

How often do voice specs need updating?

The required-moves list and worked examples are relatively stable, update when your brand positioning shifts, or when you notice consistent drift between the spec and your best new content. The banned-phrase list needs quarterly attention. Phrases that were distinctive AI tells a few years ago have proliferated to the point where they’re now just average internet writing. The target is a moving average you’re trying to stay away from.

What’s the minimum viable voice spec for a team just starting?

Two required moves, five banned phrases, one cadence line, and two worked example sentences. That’s enough to produce measurably different output from an unconstrained LLM prompt. You don’t need a complete spec before you start, you need enough constraints to generate a first pass, then review that pass for drift, then update the spec based on what you observe. Voice spec maintenance is iterative, not a one-time document.