Great summary. I remember how bad voice-to-text was around 2010 and how useless I felt the tech was back then. Fast forward to today and I will frequently use Jippity Voice as a sounding board for thinking out loud, or for taking notes, or for creating an ultra-fast outline for something I want to write. It's silly useful for all those use cases, and prior to 2024 or so, I had a much harder time getting what was in my brain out and into the wider world. Voice helps so much.
Lately, Jippity will allow you to see the text and images on the screen while using Voice from your phone. I have gotten so used to just using it while I walk or wash dishes or whatever, that I haven't really taken advantage of this new form of computing yet.
I had you in mind as one of the most obvious power users of AI voice when writing this piece. I use it here and there sporadically, but it's my understanding that you have truly embraced the "walking around and bouncing ideas around with AI while talking" paradigm. Soon, your approach will likely become the default for many people.
I actually enjoy the new voice+text mode a lot. I'm quite visual (in terms of text), so when asking for outlines or other structured output via voice, I always felt it wasn't very efficient (having to listen to it read out every bullet point). I even talked about solutions to this here: https://www.whytryai.com/i/174244808/turn-voice-ramblings-into-structured-outlines
But the voice+text hybrid seems to be the sweet spot.
Have you tried just interrupting voice mode? I do that like 80% of the time. It was a lot to get used to doing this- just felt rude. Now it just seems like computing.
Yeah, of course. The ability to interrupt Jippity was what sold me on the advanced voice mode in the first place. Before that, I wouldn't touch voice mode at all. Now I interrupt liberally. Still, when I ask for structured output, I absorb it much better when I see it written out rather than spoken, which is why the voice+text hybrid works so well.
Yeah, that makes sense. There is something special about seeing and hearing at the same time, like a classroom lecture. I need to experience this sort of almost "video chat" - it's a different way to compute/think.
I'm so glad for the hands-free thing, but I would LOVE to be able to prompt actions from voice. I think that's coming soon.
My wife is peak pragmatist - super efficient - you won’t catch her playing around with tech for kicks; she don’t care. So when she commented that Siri was all of a sudden sounding like a real person and now she’s talking to her phone I thought ahha, progress.
The glue for a lot of this - and the breakthrough that still amazes me - is Natural Language Processing which has pretty much been subsumed by LLMs now. There’s so much nuance to language and even more so spoken language that taking the basics like speech-to-text and actually *understanding* the intent is magic along the same lines as neural nets.
Awhile back I wrote a bit about the evolution of voice recognition and just dipped my toe into the NLP waters, perhaps interesting if you want a little glimpse of how we got here https://newsletter.wirepine.com/p/talk-to-the-wizard
Nice that you have your wife as a litmus test for where we stand on mainstream appeal of AI voice. I liked your piece and the grandma vs. Rex analogy.
I can see it was sparked by the "Advanced Voice Mode" demo, which was actually super impressive. I don't think the actual advanced voice mode quite lived up to the OpenAI demo reels after it launched, at least in my experience. Some of the subtleties and nuanced responses weren't quite there. I wouldn't be surprised if they had to nerf it for bandwidth and other reasons.
Solid breakdown of the voice AI landscape. The point about ambient computing needing voice as a natural interface cuts to the core of why this tech matters beyond just "AI can talk now." What strikes me most is how fast emotion detection is maturing (Hume EVI) - that's the real differeniator for conversational agents vs just stitching speech-to-text and text-to-speech together. The dubbing use case is underrated, especially for solo creators who couldn't afford localization befor.
Agreed: I think truly nailing the emotional cues and making live conversations sound natural (interruptions, pauses, etc.) is likely the final piece of the puzzle when it comes to getting more people to embrace voice as a viable interface. We're getting pretty close!
Great summary. I remember how bad voice-to-text was around 2010 and how useless I felt the tech was back then. Fast forward to today and I will frequently use Jippity Voice as a sounding board for thinking out loud, or for taking notes, or for creating an ultra-fast outline for something I want to write. It's silly useful for all those use cases, and prior to 2024 or so, I had a much harder time getting what was in my brain out and into the wider world. Voice helps so much.
Lately, Jippity will allow you to see the text and images on the screen while using Voice from your phone. I have gotten so used to just using it while I walk or wash dishes or whatever, that I haven't really taken advantage of this new form of computing yet.
I had you in mind as one of the most obvious power users of AI voice when writing this piece. I use it here and there sporadically, but it's my understanding that you have truly embraced the "walking around and bouncing ideas around with AI while talking" paradigm. Soon, your approach will likely become the default for many people.
I actually enjoy the new voice+text mode a lot. I'm quite visual (in terms of text), so when asking for outlines or other structured output via voice, I always felt it wasn't very efficient (having to listen to it read out every bullet point). I even talked about solutions to this here: https://www.whytryai.com/i/174244808/turn-voice-ramblings-into-structured-outlines
But the voice+text hybrid seems to be the sweet spot.
Have you tried just interrupting voice mode? I do that like 80% of the time. It was a lot to get used to doing this- just felt rude. Now it just seems like computing.
Yeah, of course. The ability to interrupt Jippity was what sold me on the advanced voice mode in the first place. Before that, I wouldn't touch voice mode at all. Now I interrupt liberally. Still, when I ask for structured output, I absorb it much better when I see it written out rather than spoken, which is why the voice+text hybrid works so well.
Yeah, that makes sense. There is something special about seeing and hearing at the same time, like a classroom lecture. I need to experience this sort of almost "video chat" - it's a different way to compute/think.
I'm so glad for the hands-free thing, but I would LOVE to be able to prompt actions from voice. I think that's coming soon.
My wife is peak pragmatist - super efficient - you won’t catch her playing around with tech for kicks; she don’t care. So when she commented that Siri was all of a sudden sounding like a real person and now she’s talking to her phone I thought ahha, progress.
The glue for a lot of this - and the breakthrough that still amazes me - is Natural Language Processing which has pretty much been subsumed by LLMs now. There’s so much nuance to language and even more so spoken language that taking the basics like speech-to-text and actually *understanding* the intent is magic along the same lines as neural nets.
Awhile back I wrote a bit about the evolution of voice recognition and just dipped my toe into the NLP waters, perhaps interesting if you want a little glimpse of how we got here https://newsletter.wirepine.com/p/talk-to-the-wizard
Nice that you have your wife as a litmus test for where we stand on mainstream appeal of AI voice. I liked your piece and the grandma vs. Rex analogy.
I can see it was sparked by the "Advanced Voice Mode" demo, which was actually super impressive. I don't think the actual advanced voice mode quite lived up to the OpenAI demo reels after it launched, at least in my experience. Some of the subtleties and nuanced responses weren't quite there. I wouldn't be surprised if they had to nerf it for bandwidth and other reasons.
Solid breakdown of the voice AI landscape. The point about ambient computing needing voice as a natural interface cuts to the core of why this tech matters beyond just "AI can talk now." What strikes me most is how fast emotion detection is maturing (Hume EVI) - that's the real differeniator for conversational agents vs just stitching speech-to-text and text-to-speech together. The dubbing use case is underrated, especially for solo creators who couldn't afford localization befor.
Agreed: I think truly nailing the emotional cues and making live conversations sound natural (interruptions, pauses, etc.) is likely the final piece of the puzzle when it comes to getting more people to embrace voice as a viable interface. We're getting pretty close!