Full-duplex Llama 4 lives!

We shipped a thing!

Our first Llama 4 audio launch is a demo of our natural-sounding full-duplex model inside of the new Meta AI app. We focused on conversationality and naturalness and feedback has been pretty good in that respect. Mark already committed to some anon’s Threads account that we are working on getting this into default mode, so I guess we better get to work! Models like this are really, really hard to evaluate, so give it a try and let me know what you think.

NTP is all you need

Historically, language-related ML research required bringing together a diverse group of linguists and computer scientists to collaborate on understanding language and designing algorithms to mimic it. Slowly, starting with ASR (“Every time I fire a linguist, the performance of the speech recognizer goes up” –Frederick Jelinek and “We have to learn the bitter lesson that building in how we think we think does not work in the long run” –Rich Sutton), language research has transitioned from deeply understanding human communication to making GPUs go brrrrr.

This bitter lesson has come for generative audio models. The latest Orpheus, Sesame, Dia, Fish Audio, Llasa, Zonos, and OpenAI efforts show that you can “just” NTP your way to a really strong speech generation model. Most transparently, the creators of the Dia model are a couple of college kids in South Korea who admitted to knowing nothing about TTS 3 months prior to training the most expressive TTS model on the market.

ElevenLabs has now extended this from speech to sound effects. The approach is almost certainly the same.

Train tokenizer that encodes speech and non-speech audio
NTP
Profit

This is really cool from a product perspective and quite clever. Any bets on how quickly we get a free, OSS version that matches their performance? Can our friends at Dia do it before classes start again in the fall?

Mindblowing TTS expressivity with Nari Labs’ Dia

I touched on Dia above. You owe it to yourself to spend some time with the model. It is quite unstable but very impressive when you nail the prompt and the decoding settings. The model is able to generate very expressive speech in very natural voices at high quality. Cloning fails out-of-sample, which I attribute to the lack of diversity in their training data (not totally surprising for two undergrads with limited data/compute budgets).

And I do want to give a shout out to Dia for putting in the time to make their code work out of the box. I was having so much fun playing with this on HuggingFace that I ran out of credits, but I was up and running locally on my mbp in <5 min generating clips at ~0.25 RTF, which was fine for my casual experiments.

OSS interspecies communication with DolphinGemma

Regular readers know that I am quite fond of animal speech projects [1, 2, 3]. Our friends at Google just released DolphinGemma, a mini model that can run on a phone that is pioneering bidirectional communication. Yes, my friends, not only can you listen to whales and elephants now, but you can begin to talk to dolphins.

Technically, I find it interesting that the SoundStream tokenizer generalized well to dolphin clicks. It’s not obvious to me what tradeoffs they made to max performance on human speech, but they clearly did something based on their MUSHRA scores relevant to music. Maybe this says something interesting about parallels between dolphin and human language.

Amazon Nova Sonic and a new new new ASR king

Amazon released the technical paper supporting their new voice model. Like most reports these days, they provided essentially no information about training, but they did manage to announce SOTA ASR performance!

For the last three editions [OpenAI, ElevenLabs, sandlogic], someone has claimed the ASR throne, so it’s almost funny at this point to find a fourth. As I mentioned earlier, generic ASR is mostly a solved problem and the devil is in the details. Each of these models measures performance on a different set of benchmarks and I totally believe that all parties can be right at the same time. I guess the ASR kingdom is a polyarchy.

Amazon won’t let me log into the AWS Console to try Nova Sonic due to an unpaid bill that they will not let me pay, so if anyone wants to share their impressions, I will publish in the next newsletter.

ChatGPT:Gemini::Budweiser:Coors?

Benedict Evans recorded an irreverent podcast poking fun at what it means to have a product strategy for AI models that users expect to do everything. His perspective resonates with me.

To the average person, an AI product is a magical text box that answers any question posed. They aren’t creating held-out personal evals full of riddles or scrutinizing model tone like a sommelier would a 2022 Russian River pinot. To extend the metaphor, the average consumer is looking for a bag of Franzia rather than a Screaming Eagle magnum.

The current crop of models meets all of their basic needs. Our parents generally don’t know or care whether they are using ChatGPT, Claude, Meta AI, Grok, DeepSeek, Qwen, or something entirely. The basic product experience is solved for them and differentiation is all about colors, branding, and distribution.

What does this sound like? According to Ben and his co-hosts, the answer is the consumer packaged goods industry. For the average consumer, GenAI and Gemini have about as much differentiation as Budweiser and Coors. If you agree with this perspective, which I largely do in a more limited sense, all of the things that we spend lots of money on today actually don’t matter. We just need to train a good-enough model to answer grandma’s questions, design and brand well, and put it in front of her to win.

Where I disagree to some extent is that this will be a durable state. I think that consumers today expect nothing more than “super Google” from AI, but I do think that as more users get comfortable with these products, they will expect models to take consequential actions on their behalf. This should result in some long-term technical differentiation, although I recognize that it is totally possible that all labs push the frontier at approximately the same rate and may truly come down to what colors users prefer.

How should a PM org scale for gen AI teams?

Related to the point above, PMing gen AI products requires a bit of a different mindset. Gen AI tends to flatten organizations and roles and break a lot of assumptions about narrow scoping and MVPs. A key characteristic of most gen AI products is that they need to be quite general, which makes traditional prioritization challenging.

Kevin Weil, the CPO of OpenAI, recently appeared on a podcast and disclosed that OpenAI only had ~25 PMs, despite offering a huge array of features to a big chunk of the world through their multiplatform consumer apps and supporting a large enterprise business. I obviously know how the Meta GenAI PM org is structured but would love to learn more about different companies and how they use or don’t use PMs in a gen AI world.

Daniel D. McKinnon

Musings on adventuring in the modern era and tinkering with technological curiosities

Dan’s Weekly AI Speech and Language Scoop #43