Dan’s Weekly AI Speech and Language Scoop #45

Cascades are dead, long live cascades

Kyutai shipped Moshi last year with the hypothesis that voice AIs should be fun, natural, and chatty and maintain human-like conversational dynamics. I agreed and had a lot of fun playing with Moshi.

However, they found building table-stakes agents functionality like function calling, knowledge, and factuality, is challenging with their native full-duplex system, so they unveiled an impressive cascaded complement called Unmute. They haven’t released the models or technical report yet, but it is quite steerable and handles zero-shot cloning as well as anyone else. A few of their examples, like talking to a historical Charles de Gaulle, bring a smile to my face.

This is the second time (OpenAI was first) a company has released an “old-fashioned” cascaded system after a native one. I suspect that both approaches will exist in parallel for quite some time with the former providing platform-level flexibility and the latter a more natural conversational experience.

Anthropic ships voice with manual endpointing in 2025?

Anthropic teamed up with 11Labs to enable voice mode in Claude. Despite the strong underlying model and voices, the overall experience is 2023 vintage. Anthropic didn’t seem to spend much effort fine-tuning Claude for voice-friendly responses and the high-latency push-to-talk interface feels antiquated. That said, the second most upvoted comment on the HackerNews thread about its launch praised PTT, so what do I know (although at least one other AI power user agrees with me)?

Despite the general clunky execution, Anthropic was first to market with a feature that I have thought about a lot. Unlike other voice experiences that either display a transcript or glowing orb on the phone screen, Claude generates text that is complimentary to the audio. If you ask for math help, it describes the solution verbally over a concise textual derivation. If you ask how to cook a meal, it displays the recipe on the screen and explains each step.

I don’t think that most people want to use a voice mode when their hands (PTT) and eyes (complimentary text) are unoccupied, but I do think that the former is an interesting idea and the kernel of something great.

NTP is all you need pt II: The explosion of OSS TTS models continues

Last month, I joked that NTP was all you need for TTS. Since then the Cambrian explosion of OSS TTS models has continued. I didn’t have time to rigorously test all of them, but the demos are very impressive. Chatterbox, Bland, and Voice Star would all have required deep linguistic and speech synthesis knowledge last year and just require some grit, a tokenizer, and some data.

I think that we can consider vanilla TTS largely a solved problem at this point. I will be very curious to see how Cartesia, 11Labs, and other well-funded TTS companies respond to this.

Doctors are cooked pt III brought to you by Beth Israel Deaconess Medical Center

Last year, a bunch of doctors casually interested in AI for diagnostic purposes ran a small study showing the even-then outdated GPT-4-Turbo model outperformed human physicians on representative diagnostic tasks. A few months ago, OpenAI repro’ed their study with the latest and greatest

You could wave your hands and argue that the first study was small and flawed (the participants didn’t really know how to use AI at that point) and the second was OpenAI pumping their own bags, but this work has now been reproduced a third time in a larger scale clinical setting. The authors “found consistent superhuman performance in every experiment. Most importantly, the model outperformed expert physicians in real cases utilizing real and unstructured clinical data in an emergency department.”

If you are not using AI to help with diagnosis and interpretation of your own medical results, it is now overwhelmingly clear that you are not getting the best care.

Nvidia does full-duplex on a budget

Nvidia introduced a novel method for building a native full-duplex speech model with minimal compute and data budget. They feed the model user and assistant stream simultaneously using different embeddings (continuous from an ASR encoder on the user side and discrete on the assistant side) and simultaneously predict speech and text tokens at 12.5 Hz with a 1.1B TinyLlama in the middle.

Once their system is set up, they fine-tune the whole thing e2e on 3k hours of real and 27k hours of synthetic conversational data, which is a very small dataset relative to other published systems (Moshi used 7M hours for pre training and on this order for post-training).

They show they outperform Moshi across the board in both answer quality (likely due to higher quality text backbone) and conversational dynamics and the demos sound pretty good. I find it quite surprising and impressive that they were able to get such good performance with so little audio data. If this approach generalizes and scales, I think it could impact how we all think about multimodality.

Does Apple think AI is an illusion?

Last fall, Apple researchers published their first paper suggesting that models couldn’t reason because they observed some, in my opinion, largely expected patterns in LLM performance on various versions of GSM8k. The response to this paper was largely dismissive although some AI doomers used it to prove that the (then) world’s largest company acknowledged that AI was  a dead end.

Since then, model capabilities have only exploded. You can now one-shot a totally playable game with assets or condense months of research into a totally readable summary in a single click. I would be shocked if today’s crop of models didn’t saturate Apple’s previous experiment suggesting models didn’t reason.

Recently, they one-upped themselves suggesting again that models can’t reason in their paper The Illusion of Thinking. This paper has been covered in depth elsewhere, but the gist is that they show that once a series of reasoning puzzles reach a certain level of complexity, reasoning models can no longer solve them and that non-reasoning models outperform reasoning models on various tasks at certain compute budgets. While these observations are certainly reasonable (who hasn’t accidentally asked o3 a simple question and laughed at it thinking for minutes?) they don’t generalize and don’t support the conclusion that these models don’t reason by any reasonable (lol) definition of the word.

I find it quite convenient and funny that Apple’s research lab happens to be studying seemingly exclusively the limitations of these amazing models at the same time that the product team is pretending they aren’t changing everything. It’s almost like the marketing team is running the research lab these days.

Gemini native audio finally ships

Google announced Gemini native audio generation 6 months ago with a super impressive demo, but it seemed that they shipped a much less capable cascaded system as Gemini 2.0 Flash in AI Studio. However, Gemini 2.5 Flash Preview Native Audio Dialogue (wow, that’s a mouthful) finally lives!

I was most impressed by the quality of speech generation. Naturalness and expressivity are very good and the model’s ability to respond acoustically to instructions was beyond AVM. However, the model butchered the pronunciation of even common words. This is often a criticism of e2e speech generation systems without linguistically crafted features, which I think is the case here, but AVM and other competitors don’t suffer from this nearly as much.

I was least impressed by the semantic answer quality and conversationality. While Gemini’s factuality and knowledge were top notch, I felt like I was talking to Wikipedia vs. a trusted companion most of the time. Even though measured latency was on par with top competitors, the conversations felt very mechanical. There is something about the dry answers, poor follow-up questions, and formal tone that would not inspire me to spend much time with this model.

That said, this is clearly better than whatever is powering their voice experience in the Gemini app today, so I am very puzzled why this isn’t shipped more broadly. Maybe this is a designed for more boring business agent voice interactions for Cloud customers?

And in addition to ordinary Native Audio, Google shipped Native Audio Thinking. In this mode, the model decodes a bunch of thinking text tokens before beginning to respond. Waiting 5 seconds every turn for the model to do its thinking is not a good dialogue experience and in my experimentation, I couldn’t find a single prompt whose response was significantly improved with the thinking tokens. I’ve heard lots of discussions about reasoning speech models, so I’m glad someone did it and showed me what I don’t want!

ChatGPT AVM (re)learned to sing 

OpenAI just launched an updated version of AVM, which seems to have a more natural voice and can sing again. Unlike the original AVM launch, which littered the internet with amazing but unreproducible generation tricks including an airline pilot announcing turbulence with radio static, a soccer player scoring in a stadium, and various types of music performances, I can repro all the fun musical examples. The only downside is AVM refuses to reproduce copyrighted pieces, so you really need an encyclopedic appreciation of 19th century vocals.

The improved naturalness sounds very similar to our Llama 4 voice. I think that all companies are converging on the idea that crossing the uncanny valley in generation quality is an important component of building sticky voice products.