Hello and welcome back to my newsletter. I welcomed a beautiful little baby boy into the world a couple months ago, so writing has taken a back seat. Hopefully, this edition marks getting back into the swing of things. And thank you to those of you who reached out to ask when the next one would land. It was a good reminder that this is sometimes valuable to people.
Will voice see its Nano Banana moment?
Over the last 3 years, ChatGPT has solidified its position as the by-far dominant consumer AI app. Originally, OpenAI differentiated on model quality when GPT-4 was the only game in town, but now a half-dozen labs have trained equivalent models without taking significant market share. Has this race been won?
Based on my own behavior and what I saw in the market, I would have argued yes, but Google found a chink in OpenAI’s armor with Nano Banana. Since releasing this image editing model that preserved existing features dramatically better than OpenAI’s, the Gemini app has risen from buried to the top of the app store.
I wonder if the same opportunity exists for audio. Many good-enough audio experiences exist but none has really embedded itself in my life like ChatGPT has for text. I speak to AVM in French for 15 minutes per day in hopes that my son will pick up some and occasionally talk to Grok in my car, but the only voice experience that I would be annoyed if were taken away is simple commands like playing music on my Ray-Ban Meta glasses.
I don’t know what the experience is, but I have to believe that there is a consumer audio experience that has yet to be discovered that will give its discoverer an opportunity to shake up the leaderboard.
The canonical audio leaderboards are crystallizing
Last year, I wrote how the embryonic state of audio evals and leaderboards made hillclimbing and comparing models challenging. A year later, we still don’t have the MMLU or SWE-bench of audio, but several benchmarks seem to have gained some traction with Artificial Analysis’s Big Bench Audio taking the lead.
Big Bench Audio is an understanding-only benchmark that tests the ability of audio LLMs to answer complex spoken-form reasoning questions. While this is only a small aspect of audio LLM performance, it is an important one because most audio LLMs regress on this style of questions due to challenges in aligning speech and text tokens.
Interestingly, OpenAI has mostly closed this gap with the latest GPT-4o-realtime model, dropping only 7 pp (83%) from the cascaded baseline (90%), almost 20 pp better than their first attempt (66%). Great progress has been made in speech-text alignment at OAI in the last year!
A deluge of OSS audio models
While I was on leave, I felt like a new OSS audio model dropped almost every week. I haven’t had time to play with all of them (I hope to catch up next edition), here is my likely incomplete list for posterity’s sake.
- Step Fun releases 8B speech-to-speech model, which seemed to outperform the previous best option, Qwen2.5-Omni.
- Microsoft releases VibeVoice, a SOTA TTS model, which performs well on the long-form content that challenges traditional TTS models.
- MiMo releases 7B native audio model with impressive benchmark numbers
- Alibaba drops Qwen3-Omni, which sets the standard for OSS audio models. Their benchmark numbers are quite impressive. Performance is still behind OpenAI’s AVM when both were accessed via their consumer apps, but the gap is shrinking. Voice startups wanting to build on native audio models finally have an option close to the frontier.
- Qwen ASR and TTS rounded out their audio releases. There are so many great OSS ASR and TTS models out there now that I’m not sure that these will move the needle. I would be surprised if these weren’t both distilled from Qwen3-omni.
- KittenTTS releases ultra-small TTS model (15M params), which surprisingly doesn’t sound half bad
Google finally ships Duplex
7 years ago, Google shocked the world by previewing their Duplex assistant, which could make ultra realistic phone calls to small businesses to schedule haircuts for you. At that point, the general public was both amazed by the fidelity of the voice agent and uncomfortable with the ability to send an AI to do one’s dirty work. Needless to say, Duplex never shipped and the project was pushed into the Geo team, which used it to confirm local business hours and menus (while clearly stating it was an AI). This somehow felt less icky.
This month, Google finally unleashed Duplex 2.0, presumably powered by Gemini and not their old cascaded conv AI stack. If you search for “haircuts near me” or similar, a “Have AI Call” button appears below the sponsored links. If you click it, you can specify what information you are looking for and have Google call a bunch of businesses on your behalf to get an answer, which it summarizes in a nice email.
My reaction to my first spin of “Have AI Call” was a mix of awe and horror. I hate calling around and this solves my problem! But am I really ok having a robot spam a bunch of small businesses with stupid questions? I guess I need to start a company to help these businesses play defense and pick up the Google robo calls with an AI of their own.
11Labs does music; lawyers inbound?
Music generation is a hard problem. Not because it is technically hard but because the best training data is locked behind very robust IP safeguards. The studios control access to their catalogs far more effectively than other sources of copyrighted data for AI training.
Until now, music generation has been the domain of startups like Suno that do a good job but are being sued into oblivion, research efforts like FAIR’s MusicGen that are neat but could never replace human musicians, and Chinese models that may not be as interested in respecting Western IP laws.
I was a bit surprised to see 11Labs throw its hat into the ring given how fraught this space is, but they seem to be performing well on the leaderboards and my subjective listening tests supposedly without training on any unlicensed work. I will be curious to see how IP law shakes out here.
AI therapist? Not in Illinois
I am a pretty progressive guy. If people find value in something and it’s not hurting anyone else, what’s the problem? AI therapy is this and then some. As clearly evidenced by the GPT-4o deprecation debacle, many people had built relationships with this model and came to it for deep personal questions. And what’s wrong with this? Human therapists are within reach for most of the readers of this newsletter, but the only way to provide access to therapy to all 8B people on our planet is with AI.
But apparently the State of Illinois disagrees. I have no idea what banning AI therapy bubbled to the top of the priority list for JB Pritzker, but if you are a voice AI startup focusing on therapy, you are not welcome in the Land of Lincoln.
And on that note, Sam Paech at Liquid AI wrote some nice evals on EQ and spiraling, which any prospective AI therapist should probably try to maxxx.
OAI takes a cue from Palantir and forward deploys speech engineers
We all know that the market for voice is largely split between companionship and business. Most startups like Fixie and Rime are focusing on the latter and most big consumer companies on the former. I was surprised to learn that OpenAI has extended from offering a voice platform to actually sending engineers to customers to build applications. One of their first big engagements was handling customer support for T-Mobile. This podcast is worth a listen if you want to learn a bit more about how OpenAI thinks about platforms vs. applications.
EDIT: a careful reader pointed out that OpenAI did not actually send their own engineers to T-Mobile. In fact, they contracted with DistylAI to do the work. This is not what they implied in the podcast and kind-of a bummer that they didn’t give credit to their partners.
Doctors are cooked pt. V
To keep up with a sub-theme of this newsletter, two additional studies landed showing AI’s superhuman performance on new medical tasks.
A huge group of doctors benchmarked o1-preview (remember, medicine is low) on diagnostic tasks and found, not surprisingly, that it outperformed GPT-4o and the human physicians that GPT-4o previously outperformed. This is the first time I’ve seen a reasoning model benchmark come from medicine vs. the AI industry and just confirms what OAI and MSFT have previously reported.
And another group at Johns Hopkins showed that their deep learning model could predict surgical complications better than humans. None of this is surprising to me and will ultimately result in better care for all of us.
Maybe doctors aren’t cooked?!?!
Frontier AI models have been shown to outperform human physicians on essentially all diagnostic benchmarks, but some researchers at Microsoft demonstrated that their responses are brittle and sensitive to minor perturbations in formatting. I don’t doubt their conclusions, but I don’t think that they support the title of their paper, “The Illusion of Readiness”.
They ignore that fact that most (yes, most) physicians are using AI tools in their practice today and finding them valuable (if you don’t believe me, Google Abridge or OpenEvidence), consumers are finding value in using AI tools to understand their health, and humans also make mistakes. I would have loved to have seen a human baseline on all their question permutations.
My conclusion is that AI is already changing medicine, but yes, we do have a few kinks to work out and this paper points out a few valid ones.