Dan’s Weekly AI Speech and Language Scoop #40

Measuring the quality of speech generation is hard; Meta’s FAIR takes a big step in making it easier

People have been building speech generation systems for a long time. Measuring their quality is bedevilingly challenging and subjective. Typically, models are placed in front of a number of reviewers and audio is rated from 1-5 (the fancy name for this is MOS or mean opinion score). I don’t think that anyone really cared for a long time. TTS systems were generally responsible for communicating dry information like Alexa telling you that your timer has elapsed.

Two things changed:

  1. LLM-based TTS systems just want to learn from data. Generic AI teams can now create strong TTS models without specific linguistic experience leading to a profusion of models, many of which we’ve covered in this newsletter.
  2. LLM backbones (and e2e audio LLMs) enable users to have real conversations with AIs. This raises the bar for the level of expressivity and naturalness that users expect.

However, just because we now have the will and the way to build more natural speech generation systems does not mean that we know how to define “good” and scalably get there.

To address this problem, the FAIR team at Meta built Audiobox Aesthetics, a model designed specifically to address this problem. They trained a model to predict human perception of production quality, production complexity, content enjoyment, and content usefulness. This is the best attempt I’ve seen so far at automatically capturing the nuance of human preference for synthetically generated speech. You can give it a try yourself at their HuggingFace demo.

Another SOTA ASR model; What does this even mean?

ElevenLabs released a closed-source ASR model called Scribe aimed squarely at competing with OpenAI’s Whisper, which has achieved close to 100% market share in the developer community for its good-enough performance on essentially all commonly spoken languages.

OpenAI was smart when they released Whisper. Rather than targeting ASR benchmarks like Librispeech that are so enticing to academics, they realized that ASR was mostly a solved problem and they could build something general purpose that was easy to set up and worked for most people in most situations.

ElevenLabs adopted the same approach but added a claim that theirs is the “world’s most accurate ASR model”. They don’t release a technical report and the figures in their blog post are too blurry to read, but they generally base this claim on beating Gemini 2.0 Flash, DeepGram Nova 2, and Whisper Large, which itself never claimed ASR status on FLEURS and Common Voice.

But are these the right competitors and metrics? We discussed a new SOTA on Librispeech last month that itself claimed the throne. Is Scribe better? Is there one throne? Samba-ASR almost certainly outperforms Scribe on clean English speech. Scribe almost certainly on its own domain. Meta’s models on prod Facebook traffic. Google’s on Youtube.

General ASR is very much a solved problem, but specific problem spaces, e.g. those involving complex entity recognition, accented speech, long-tail languages, etc., will have their own specific solutions. Happy to see another ASR model enter the ring, but I question what SOTA even means at this point.

MSFT does audio with Phi-4-multimodal

No technical report yet, but Microsoft released a brief blog post on Phi-4-multimodal. It is not totally clear how their model works or whether it is a true omni (type D) model. The only hints that they give are that is uses a “mixture of LoRAs” architecture, which perhaps implies that a different LoRA is loaded depending on the modality.

They report strong general audio understanding numbers, dethroning Qwen2-audio as the leading OSS model (remember: this is audio understanding not dialogue; I don’t think that this model generates speech or holds a conversation). Interestingly, Phi-4 handily outperforms Gemini-2.0-Flash and GPT-4o-RT-preview, the “SOTA” ASR models against which ElevenLabs decided to benchmark themselves. They should have thrown Phi-4 into the ring!

Step Fun enters the audio LLM chat

An unknown (to me) group called StepFun OSS’ed what I think is the first QA focused and production ready-ish native-ish audio LLM. Unlike Moshi, which really targeted creating a fun, chatty companion, StepAudio could conceivably be dropped into “serious” audio LLM use-cases like assistants and business agents. Their demo is a bit underwhelming, but I suspect that a big piece is that they focused on Chinese rather than English.

Beyond OSSing this cool model, the interesting bits of their approach are:

  • Input speech provided to the model in three modalities: a transcript, a linguistic tokenizer, and a semantic tokenizer. This is similar to Moshi’s approach. They could not drop perplexity with a single tokenizer.
  • Their detokenizer is actually a standalone streaming TTS model that can process both audio tokens and text. They did this to separate the understanding and generation components due to scarcity of data and controllability and let them benchmark against competitive TTS systems.
  • They use a SpiritLM-style interleaving scheme to teach semantic/linguistic/text token alignment midway through pretraining and then polish off by adding ASR and TTS tasks alongside more audio and interleaved audio-text.
  • They CPT audio into a pre-trained text LLM. It appears that they maintain text knowledge by slowly introducing speech across three stages.
  • When the VAD model detects user speech, the LLM starts generating tokens in anticipation of the user finishing. If the response is not appropriate, it is thrown away. If it is, latency drops ~500 ms compared to non-speculative methods.
  • Speech tokens are replaced by text generated by ASR in the context window to be more efficient with history (1:14 compression!).

They perform very strongly relative to every other (weak) OSS audio LLM on both standard OSS benchmarks and their own Chinese-language one. Keeping with the theme of SOTA ASR, Step Audio also outperforms Whisper in every benchmark, as does Qwen2-audio, which is benchmarked alongside.

Grok 3’s deep research smokes Gemini, Perplexity, and ChatGPT

Whether you like him or not, Grok is a pretty good model/app/system. I finally got access to ChatGPT’s Deep Research, so I pitched it against Grok 3, Gemini, and Perplexity to answer my favorite question: “what is the best OSS zero-shot voice cloning project.”

If you read this newsletter every now and then, you will know that this is something that I track carefully and new systems keep dethroning old ones. Only Grok came to a solution even close to correct, Llasa-3, while all three others recommended year or two old projects like Tortoise TTS that aren’t even in the ballpark any longer.

I think that this is due to some combination of so much AI discussion happening on Twitter and Grok just frankly being quite a good model. I saw similar but not as striking differences across a number of other representative deep research queries.

Router → #1 in the Chatbot Arena; Elon could have saved a few billy

Some clever UCB students trained a router to predict which model generated the most preferred response to a given prompt. They found, not surprisingly, that even if overall leaderboard performance is similar, some models are better at certain tasks.

They submitted their router to the Chatbot Arena and achieved the #1 position with an Elo of 1400, which matches where Grok 3 debuted. Elon and the X.ai team could have saved a heck of a lot of money by leveraging this approach!

Everything you’ve ever wanted to know about pretraining (but were afraid to ask)

HuggingFace published the Ultra-Scale Playbook on their adventures in pre-training. The authors clearly explain how to train large models on distributed systems and all the different types of parallelism involved. This is the best explanation of the whole process I’ve read and would recommend giving it a read if you are interested in this kind of thing.

Perplexity defaults to push-to-talk for voice interaction; how strange

There are generally two types of products: synchronous like OpenAI’s Advanced Voice Mode and asynchronous like WhatApp’s voice message dictation. The former represents a real conversation with an AI. The system determines when the user has finished speaking, what she said, and how to respond. The latter usually just sticks a speech recognizer and button to indicate when she is talking on top of an existing product and is a shortcut to typing.

Perplexity released a voice interface that combines in my mind the worst of both of these worlds. The user can neither lean back and have a hands free conversation because she needs to endpoint the speech herself nor quickly scan a text answer for the relevant information. While you can switch to a more natural endpointed option, albeit slow and hardly conversational, I am very curious as to why they made this design choice and how long it will take to reverse (or if they’ve discovered something new).