Complete Beginner's Guide to Voice AI
A look at voice AI categories, use cases, and major players.
Prior version of this article first appeared as a guest post for AI Supremacy
When people mention generative AI, they almost always talk about large language models, followed by image and video AI tools.
Voice AI? Not so much.
But while we’re all distracted by smooth-talking chatbots and flashy visuals, voice has quietly become one of the more practical AI applications.
Every mainstream AI chatbot platform now comes with a voice chat feature, your meetings transcribe themselves thanks to AI notetakers, and you can dub a video into dozens of languages without having to re-record it.
All of that is thanks to voice AI.
We’re on a path toward ambient computing, and voice is critical in making it happen, because it’s a far more natural way to communicate with devices that surround us.
So let’s break down what voice AI is, what it can do, and the main players in the field.
What is voice AI?
While you can find several overlapping definitions, I’ll use “voice AI” to refer to any AI model that can capture, process, and reproduce speech (often all three).
Basic tools can transcribe your voice meetings, while the more advanced ones can hold entire lifelike conversations.
So “voice AI” is less of a single technology and more of an umbrella term for tools that make working with speech possible.
Where do we stand with AI voice?
Most of us likely still associate voice AI with the canned, robotic assistants like Siri et al.
But things have been moving fast over the past couple of years.
For one, every major chatbot now has a voice mode for real-time conversations:
ChatGPT (OpenAI)
Claude (Anthropic)
Gemini (Google)
Grok (xAI)
Le Chat (Mistral)
Microsoft Copilot (Microsoft)
Perplexity (Perplexity)
If you haven’t already, having a voice chat on your favorite platform is one of the quickest ways to experience what these are capable of.
In almost every case, you can activate voice mode via either the microphone or the waveform icon to the right of the chatbox.
Some platforms like Claude or Gemini only enable voice mode on the mobile app, while ChatGPT lets you talk on the desktop, too.
If you want to experience the best option, I’d go with ChatGPT. Most other voice modes use a combination of modules behind the scenes:
Speech-to-text to capture your speech and feed it to the large language models (LLM)
LLM itself to understand the text and formulate a response
Text-to-speech to output the LLM’s response via voice
But ChatGPT’s “advanced voice mode” is natively multimodal, meaning it can understand and generate speech directly. Try some of these:
Start speaking to interrupt it mid-response
Ask it to speak slower or faster
Ask it to mimic an accent
Ask it to interpret your tone of voice and whether you’re loud or quiet
To see how far voice AI has come, check out this demo from ElevenLabs:
You can even test their text-to-speech features out for free by visiting elevenlabs.io.
Or how about a free voice chat with Hume’s impressive EVI 3 model that can detect and respond to emotional cues over at demo.hume.ai? (You can design a voice from scratch or pick one of their presets.)
We’ve come a long way from monotone voice assistants of old, haven’t we?
Top use cases of Voice AI
The following voice AI elements are the building blocks of most real-world use cases:
Voice capture (speech-to-text): Converts spoken words into text. Used by transcription tools, voice assistants, and meeting notetakers.
Voice generation (text-to-speech): Speaks typed text out loud. Used for voiceovers, audiobook narration, and accessibility features.
Conversational AI (speech-to-speech): Handles real-time back-and-forth conversations. Used by chatbot voice modes and customer service agents.
Voice cloning and personalization: Creates or modifies custom voices based on real-world input or parameter tweaking. Used for avatars, brand videos, and other personalized content. (I tested 10 voice cloners here.)
Localization (dubbing and translation): Translates or dubs content into other languages, often while preserving tone and delivery. Used for podcasts, training videos, and other content with a worldwide audience.
Post-editing tools: Cleans up or tweaks voice recordings. Stuff like noise reduction, tonal tweaks, or voice enhancement.
In practice, most use cases will often blend several of these capabilities. For instance, advanced voice chat may use both speech-to-text and text-to-speech to support real-time conversations. Localization tools might combine voice capture and translation with voice cloning to create dubbed versions that sound like the original speaker.
With that in mind, let’s look at some of the use cases for voice AI.
1. Transcription & meeting notetakers
This is likely the most mainstream use case. There are dozens of tools that can record your meetings, transcribe them, and generate a summary with action items. They typically support all major online meeting platforms like Google Meet, Microsoft Teams, and Zoom. Many of these AI meeting notetakers also come with handy Chrome extensions or smartphone apps.
2. Conversational AI
We already covered this category in the intro, but it’s worth noting as a standalone use case. Voice modes in major chatbots let you have natural conversations with large language models, so you can ask questions, brainstorm ideas, and handle any other LLM-based task via voice instead of text.
For many people, conversational AI is becoming the preferred way to interact with chatbots, especially while on the go or when using a smartphone and typing isn’t optimal.
Examples: Voice mode in any chatbot and Hume AI’s EVI
3. Real-time voice agents
AI voice agents are a subset of conversational AI that businesses or individuals can deploy for their customers. These agents can handle customer service calls, accept bookings, field sales queries, and more. They are typically trained on the company’s internal knowledge to better handle context and can escalate to a person when needed.
If you’ve ever called a business and chatted to a bot that seemed to actually understand you, it was likely an AI voice agent.
4. Narration & audiobooks
As you’ve seen with the ElevenLabs demo, text-to-speech models are becoming increasingly lifelike and capable of replicating nuanced emotions, accents, and more. This makes them ideal for narrating audiobooks, creating podcasts, or reading out YouTube voiceovers.
For indie authors or creators, these tools can be a cost-effective way to turn texts or scripts into polished narration.
Examples: ElevenLabs and Murf.ai
5. Accessibility (screen reading & voice control)
For vision-impaired users, voice AI can be a true lifeline. AI-powered screen readers can describe images, navigate interfaces, and read web content aloud. At the other end of the spectrum, speech recognition can let users control devices hands-free. When you combine these features with vision capabilities in e.g. Gemini or ChatGPT, these voice assistants can even help blind users navigate the real world.
Examples: Be My Eyes and Seeing AI
6. Voice cloning & custom voices
Want to create a branded voice for your company or an AI narrator that sounds just like you for video presentations? Voice cloning and custom voice creation tools let you train AI on sample recordings or design new voices from scratch.
These can be used for anything from personal avatars to corporate presentations narrated by an AI version of a company representative.
Examples: ElevenLabs and Resemble AI
7. Dubbing & localization
These days, taking any video or audio content in one language and localizing it for a different audience is becoming trivial. Many tools can dub speech into a different language while retaining the original speaker’s voice, complete with lip-syncing. Such tools aren’t flawless, but they offer an efficient way for solo creators to scale their content globally with minimal effort.
8. Live translation & interpretation
Even more impressively, we’re starting to see tools that can handle such dubbing in real time, which lets them act as AI interpreters for live online meetings. Two participants speaking different languages can hear the other person “speak” in their original voice but in the listener’s language.
These have the potential to truly bridge language barriers and help multilingual teams collaborate across the globe. While they’re still in experimental stages, I expect these tools to rapidly become more polished and usable.
Examples: Google Meet Speech Translation and Microsoft Teams Interpreter
9. Language tutors & learning chatbots
Voice AI has gotten good enough to support a whole range of educational chatbots. Traditional platforms like Duolingo are experimenting with real-time voice features. But that’s just scratching the surface. We are also getting dedicated language tutor apps and sites that help you improve communication skills, all powered by voice AI.
10. Brand & gaming avatars
Voice AI also powers video avatars that brands use for training videos and presentations. These can deliver scripted content with realistic facial movements and expressive voices.
In the gaming world, voice AI lets developers create immersive interactions with non-player characters that have unique personalities and can act in non-scripted ways.
It’s still early days, but things are getting better all the time.
Key players in voice AI
The voice AI space is crowded!
For one, most heavyweights in generative AI have their own in-house voice models:
Alibaba (Qwen3 ASR)
Amazon (Polly and Transcribe)
Google (Text-to-Speech / Speech-to-Text)
Microsoft (Azure AI Speech)
NVIDIA (Riva)
OpenAI (Whisper)
But for our list, let’s look at the more specialized players that focus on or incorporate voice AI:
Agents & conversational AI
These let you build AI assistants that can hold natural conversations like customer calls, in-game dialogue, and more.
Convai: Plug-and-play interactive characters for video games and virtual worlds.
Hume AI: Emotion-aware voice models for expressive, empathic conversational agents.
PolyAI: Enterprise-grade voice agents that handle customer calls with natural speech.
Retell AI: Tools to build, test, and deploy production-ready AI voice agents.
Synthflow: No-code deployment of enterprise-ready voice agents for automated phone calls.
Vapi: Developer platform to build, run, and monitor real-time voice agents at scale.
Voiceflow: Workspace to design and deploy chat and voice assistants for product teams.
Dubbing & avatars
These tools handle dubbing, lip-syncing, and avatar generation for training videos, marketing content, and more.
Deepdub: AI dubbing and voiceovers with studio tools in a unified platform.
HeyGen: Custom video avatars with lip-sync and multi-language dubbing.
Rask: AI-powered audio and video dubbing via the app or API.
Synthesia: Business-friendly AI presenters with custom voiceovers in 140+ languages.
Note takers & transcription
Tools that join your meetings, transcribe them, and generate summaries automatically.
Fathom: Free AI notetaker that records, transcribes, and summarizes Google Meet, Microsoft Teams, and Zoom meetings.
Fireflies: Meeting bot that summarizes your calls and creates action items across 100+ languages.
Notta: Cross-platform AI notetaker that delivers searchable transcripts with support for collaborative edits.
Otter: One of the most popular note takers for all major meeting platforms, also available as a smartphone app.
tl;dv: Multi-lingual note taker that generates summaries, drafts follow-up emails, and updates your CRM.
Speech generation & voice cloning
These platforms turn text into lifelike speech or clone existing voices to create personalized content and narration.
Altered: Real-time speech-to-speech voice changing and cloning for creators and streamers.
ElevenLabs: One of the most well-known players with studio-quality multilingual TTS and instant voice cloning for narrations and characters.
LOVO: Creator-oriented TTS marketplace with a huge catalog of AI voices and tools.
Murf AI: All-in-one voiceover studio with editable TTS, dubbing, cloning, and more.
NaturalReader: Text reader that turns webpages, PDFs, and docs into human-sounding audio.
PlayHT: High-fidelity TTS and voice cloning with fine control over emotion, pacing, etc.
ReadSpeaker: Accessibility-grade TTS for education and enterprise platforms.
Resemble AI: Custom brand voices and deepfake detection tools for security-conscious enterprises.
Respeecher: Production-grade voice cloning for film, TV, and game dialogue.
Speechify: Popular read-aloud app with lifelike voices and cross-device listening.
Voicemod: Real-time voice changer and soundboard for games, streams, and calls.
WellSaid Labs: Enterprise-ready AI voices created with real actors.
Voice infrastructure (APIs / SDKs)
These platforms provide tools and voice models for developers to build their own voice-powered applications.
AssemblyAI: Accurate speech-to-text API with speaker identification, auto-formatting, and more.
Cartesia: Low-latency speech generation and cloning for interactive voice apps.
Deepgram: Unified speech platform with streaming ASR, TTS, voice agents, and audio intelligence.
Picovoice: On-device voice AI for offline use, including text-to-speech, speech-to-text, and more.
Rev AI: Developer-friendly ASR with highly accurate transcription across domains.
Soniox: Real-time multilingual speech AI for transcription and translation in 60+ languages.
SoundHound AI: A platform for building and deploying enterprise AI voice agents.
Speechmatics: Low-latency speech-to-text models for multilingual, multi-speaker conversations.
Voice processing
These tools help with post-processing by cleaning up audio, enhancing speech, and more.
Adobe Podcast: A collection of tools to enhance speech, clean up audio, remove background noise, and more.
Descript: Edit podcasts and videos by editing the script; overdub voices, remove filler words, etc.
Krisp: Real-time meeting assistant that removes echo, noise, and cross-talk while also handling transcription and call notes.
Where are we heading?
As with most generative AI, voice AI tools are improving quickly. What feels impressive today might be old news in six months.
Soon, I expect to see tools that are even better at detecting emotion and handling real-time translation. Voice agents will be capable of handling increasingly complex tasks without having to escalate to humans.
OpenAI is doubling down on voice and releasing cheaper and faster conversational models, while companies like Hume are pushing emotion-aware voice models that can pick up on your tone as well as what you’re saying.
We’re also likely to see more offline voice AI options. Right now, most tools need an Internet connection, but on-device models are getting better. They work without WiFi, provide faster responses, and are the obvious choice for privacy-conscious companies.
If you’re new to all of this, jump into a voice conversation with your favorite chatbot. It’s the quickest way to get a taste of what voice AI can do today.
Then remember that it’s only going to get better from here on out.
🫵 Over to you…
Have you tried voice AI in your own personal or work life? What voice AI categories or use cases are the most relevant to you?
Leave a comment or drop me a line at whytryai@substack.com.
Thanks for reading!
If you enjoy my writing, here’s how you can help:
❤️Like this post if it resonates with you.
🔄Share it to help others discover this newsletter.
🗣️Comment below—I love hearing your opinions.
Why Try AI is a passion project, and I’m grateful to those who help keep it going. If you’d like to support my work and unlock cool perks, consider a paid subscription:


