
Gemini 3.1 Flash TTS for Next.js: ship voice UX in 15 min (2026)
Google's new gemini-3.1-flash-tts-preview adds 200+ audio tags, 70+ languages and multi-speaker output for $20/M tokens — wire it into a Next.js server action.
What's new this week
On April 15, 2026, Google shipped Gemini 3.1 Flash TTS as a public preview on AI Studio and Vertex AI. The model ID is gemini-3.1-flash-tts-preview, and it introduces 200+ inline audio tags (for example [whispers], [happy], [pause]), 30 prebuilt voices, native multi-speaker dialogue, and coverage across 70+ languages. The free tier is open for prototyping; paid usage is $1 per million text input tokens and $20 per million audio output tokens — roughly an order of magnitude cheaper than ElevenLabs at the same run-time. Output is 24 kHz mono PCM, returned inline as base64, so there is no webhook dance and no separate voice-studio account to manage.
Why it matters for builders
Web engineer. Previously, adding a "listen to this article" button to a Next.js blog meant wiring ElevenLabs or patching tts-1-hd with custom SSML to fix prosody. With Flash TTS, a single generateContent call and one inline [slow] or [excited] tag produces the same emotional pacing — no SSML build step, no separate voice studio, and the audio streams back as base64 PCM you can buffer straight into an <audio> element. The full call lives in a server action, so API keys stay on the server.
AI engineer. Building a voice agent that reads CRM notes aloud used to need two inference passes (LLM then TTS) plus manual speaker diarization. Flash TTS accepts multi-speaker transcripts inline: your LLM emits Joe: ... Jane: ... and TTS returns one WAV with two distinct voices. Your agent graph loses a node, latency drops by a full network round-trip, and you skip maintaining a separate speaker-labelling prompt that drifts every model upgrade.
Indie maker. A Duolingo-style pronunciation app priced at $4/month barely broke even on Azure TTS at roughly $0.05 per lesson. At $20 per million output audio tokens, a 30-second lesson now costs about $0.003 — gross margin holds above 90% on the same $4 tier. Voice is no longer the line item that kills your side project's unit economics, which means TTS-heavy features like audiobook summaries, podcast previews, or accessibility narration finally pencil out on a free-tier SaaS.
Hands-on: try it in under 15 minutes
Grab a free API key from aistudio.google.com, store it as GEMINI_API_KEY, then install the SDK:
npm install @google/genai wavMinimal Node/TypeScript call wrapped as a Next.js 16 server action:
"use server";
import { GoogleGenAI } from "@google/genai";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY! });
export async function synthesize(text: string, voice = "Kore") {
const res = await ai.models.generateContent({
model: "gemini-3.1-flash-tts-preview",
contents: [{ parts: [{ text }] }],
config: {
responseModalities: ["AUDIO"],
speechConfig: {
voiceConfig: { prebuiltVoiceConfig: { voiceName: voice } },
},
},
});
const b64 = res.candidates![0].content!.parts![0].inlineData!.data!;
return Buffer.from(b64, "base64"); // 24kHz mono PCM
}
await synthesize(
"Say warmly: [slow] Welcome back, Alex. [happy] You crushed this week."
);The inline [slow] and [happy] tags steer pacing and emotion mid-sentence — no separate prosody config. Tags must live inside square brackets, separated by text or punctuation: two adjacent tags will error. For a two-person podcast intro via cURL:
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-tts-preview:generateContent" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"contents":[{"parts":[{"text":"TTS between Joe and Jane: Joe: [excited] The feed dropped. Jane: [amused] Took long enough."}]}],
"generationConfig":{"responseModalities":["AUDIO"]}
}' --output podcast.wavPipe the output through ffmpeg -i in.wav -b:a 64k out.mp3 if you need smaller transfer. Preview rate limits follow Gemini Flash defaults (10 RPM on free, 1,000 RPM on paid) — fine for prototyping. For production, queue synthesis in a BullMQ worker and cache finished clips in S3 or R2 keyed by a hash of text + voice + tagSet; a cache hit rate above 60% on a changelog feature is common. One caveat: preview model IDs have been renamed twice in the Gemini 3.1 family this quarter, so read the exact ID from an env var rather than hard-coding it.
How it compares to alternatives
| Gemini 3.1 Flash TTS | OpenAI gpt-4o-mini-tts | ElevenLabs Flash v2.5 | |
|---|---|---|---|
| Starts at | Free tier; $1/M text + $20/M audio tokens paid | $0.60 per 1M input chars, no free tier | $5/mo Starter (30k credits) |
| Best for | Multilingual, expressive multi-speaker narration | Low-latency voice replies inside GPT apps | Cloned brand voices, audiobook production |
| Key limit | Preview only — no SLA, model ID may change before GA | ~8 voices, fewer expressive tags | Per-character billing scales fast on free-tier SaaS |
| Integration | @google/genai SDK, Vertex AI, REST/cURL | OpenAI SDK, streaming WebSocket | REST API, WebSocket streaming, native SDK |
Try it this week
Pick one text-heavy screen in your product — an onboarding intro, a weekly changelog entry, a lesson summary — and wire Flash TTS behind a "Play" button. Ship it behind a feature flag so you can A/B the voice UX on 10% of sessions, then compare time-on-page and replay counts; if replay rate clears 15%, keep it and expand to every long-form page. For wider context on where the Gemini stack sits today, read our Gemma 4 review and the Q1 2026 Web+AI recap for the pricing shifts since January.
Get weekly highlights
No spam, unsubscribe anytime.
Ranked.ai
AI-powered SEO & PPC service — fully managed, white hat, and built for modern search engines. Starting at $99/month.
LittleBird
AI-powered deep research & outreach automation — find leads, analyze markets, and write personalized emails at scale.



Comments (0)
Sign in to comment
No comments yet. Be the first to comment!