Google just released Gemini 3.1 Flash TTS, and it's a meaningful step forward for developers building voice-enabled applications. The headline feature: audio tags that let you control vocal style, pacing, and delivery using natural language prompts embedded directly in your text.
This isn't just "make it sound excited." We're talking about scene direction, speaker-specific notes, and mid-sentence emotional pivots—all controllable through text prompts.
I've reviewed the announcement and developer documentation. Here's what's new, how the audio tags work, and a practical example you can adapt for your own projects.
What's New in Gemini 3.1 Flash TTS
1. Better Speech Quality (By the Numbers)
Google claims this is their "most natural and expressive model to date." On the Artificial Analysis TTS leaderboard—which aggregates thousands of blind human preference tests—Gemini 3.1 Flash TTS scored an Elo rating of 1,211, placing it in the "most attractive quadrant" for quality-to-price ratio.
In practical terms: the voice sounds less robotic, with better prosody and fewer awkward pauses.
2. Audio Tags: The Real Innovation
This is where things get interesting. Audio tags let you embed directorial instructions directly into your input text. Think of it like stage directions in a script:
[Scene: A cozy coffee shop, soft jazz playing in the background][Narrator: Warm, conversational tone, moderate pace]"Welcome to the show. Today we're talking about..."[Pause: 2 seconds][Narrator: Leaning in, slightly more intense]"...and this is where things get controversial."
The model parses these tags and adjusts the vocal performance accordingly.
3. Three Levels of Control
Google's documentation breaks audio tags into three categories:
Scene Direction – Set the environment and context. This helps characters stay "in-character" across multiple turns of dialogue.
Speaker-Level Specificity – Define Audio Profiles for each character (voice, accent, baseline tone), then use Director's Notes to adjust pace, tone, and accent dynamically.
Inline Tags – Pivot mid-sentence. A character can start calm and become agitated without breaking the flow.
4. Global Scale + Safety
- 70+ languages supported with localized accent and pacing controls
- SynthID watermarking embedded in all output audio to detect AI-generated content
- Available via Google AI Studio, Vertex AI, and Google Vids
A Real-World Example: Building an Interactive Audiobook Player
Let's build something concrete. I'll walk through a Node.js example that generates multi-character audiobook chapters with dynamic voice direction.
Project Setup
bashmkdir gemini-audiobook-democd gemini-audiobook-demonpm init -ynpm install @google/generative-ai dotenv
Create a .env file with your Gemini API key:
GEMINI_API_KEY=your_api_key_here
The Code
Here's a working example that generates a two-character dialogue scene:
javascript// index.jsrequire('dotenv').config();const { GoogleGenerativeAI } = require('@google/generative-ai');const fs = require('fs');const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY);async function generateAudiobookScene() {const model = genAI.getGenerativeModel({model: 'gemini-3.1-flash-tts-preview'});const script = `[Scene: A dimly lit detective's office, rain tapping against the window][Detective Morgan: Gravelly voice, slow pace, weary tone]"Look, I've seen this story before. Someone walks in,thinks they can outrun their past..."[Pause: 1.5 seconds][Detective Morgan: Leaning forward, more intense]"...but the past always catches up."[Client: Younger voice, nervous, slightly faster pace]"That's why I'm here. I need your help."`.trim();const result = await model.generateSpeech(script, {voice: 'en-US-Standard-A', // Base voice profileoutputFormat: 'mp3',});// Save the audio fileconst audioBuffer = result.audio;fs.writeFileSync('chapter1-scene1.mp3', audioBuffer);console.log('✓ Audio generated: chapter1-scene1.mp3');}generateAudiobookScene().catch(console.error);
What This Does
- Sets the scene – The
[Scene: ...]tag establishes context for the model - Defines character voices – Each speaker gets a profile (gravelly vs. young, slow vs. fast)
- Adds dramatic pauses –
[Pause: 1.5 seconds]creates natural breathing room - Shifts tone mid-dialogue – The detective goes from weary to intense
Extending the Example
For a full audiobook, you'd:
- Create a JSON file mapping characters to voice profiles
- Parse chapters into scenes with tagged dialogue
- Batch-generate audio files per scene
- Concatenate with
ffmpegfor final chapter output
javascript// characters.json{"detective-morgan": {"voice": "en-US-Standard-B","baseline": "gravelly, slow, weary"},"client": {"voice": "en-US-Standard-C","baseline": "young, nervous, faster"}}
Other Use Cases Worth Exploring
1. Dynamic Customer Service IVR
Replace robotic phone menus with context-aware voice responses. Tag urgent messages with [Tone: Urgent, pace: faster] and friendly confirmations with [Tone: Warm, pace: relaxed].
2. Language Learning Apps
Generate dialogue with adjustable pacing: [Pace: Slow, clear enunciation] for beginners, [Pace: Native, casual] for advanced learners.
3. Accessibility Narration
Create expressive read-aloud experiences for ebooks or articles. Use tags to emphasize key points: [Emphasis: Strong] or [Pause: 3 seconds] before important concepts.
4. Podcast Intros/Outros
Generate consistent branded intros with scene-setting: [Scene: Upbeat tech podcast, energetic host] and export the settings for reuse across episodes.
Pricing and Availability
Gemini 3.1 Flash TTS is available now in preview:
- Google AI Studio: Free tier available, then pay-per-use
- Vertex AI: Enterprise pricing (contact Google Cloud)
- Google Vids: Included for Workspace users
Pricing details weren't announced at launch, but Google positioned the model in the "high quality, low cost" quadrant on the Artificial Analysis benchmark.
Limitations to Keep in Mind
- Preview status – APIs may change before GA
- Audio-only output – The model generates audio files, not real-time streaming (as of this writing)
- Tag parsing is experimental – Complex or conflicting tags may produce unpredictable results
- SynthID watermark – Required for all output; cannot be disabled
Getting Started
- Head to Google AI Studio
- Select
gemini-3.1-flash-tts-previewas your model - Experiment with audio tags in the playground
- Export your settings as API code once you've dialed in the voice
The Bottom Line
Gemini 3.1 Flash TTS is the first text-to-speech model that feels less like "reading text aloud" and more like "directing a voice performance." The audio tags are intuitive, the quality is competitive, and the global language support makes it viable for international projects.
For developers building audiobooks, voice assistants, or accessibility tools, this is worth a serious look. The preview is free to experiment with—spend an afternoon building a prototype and see if the creative control matches your needs.
Have you tested Gemini 3.1 Flash TTS? I'm curious to hear how the audio tags perform in production workloads. Drop a note with your experience.

Comments