Dan’s Weekly AI Speech and Language Scoop #34

Hail King Gemini Pt I: #1 Audio LLM Judge

Long ago (last year lol), almost every LLM paper relied on GPT-4 to judge their models’ responses. At first, this practice was widely criticized, but over time, it became universal (and evolved into training use-case specific reward models). A number of papers confirmed that GPT-4 (and other strong models) align well with human preferences.

While using LLMs as judges is pretty common to score transcribed audio for spoken QA benchmarks, I haven’t seen anyone using an LLM to directly score audio responses (or standalone TTS systems for that matter). Evaluating the quality of generated speech is really hard. Evaluated whether that speech is appropriate for the prompt is even harder. 

If you read TTS papers, you will usually see two types of evals: human-annotated Mean Opinion Score (MOS) and WER on generated and then transcribed speech. The former probably gets at the quality reasonably well, but setting up human evals is hard and the results tend to be noisy. The latter only determines whether the speech is intelligible.

Wouldn’t it be nice if you could just pass audio directly to a machine that would tell you how good it was across whatever axes you care about? The answer is yes.

I generated a bunch of clips of my voice using a number of OSS TTS systems and passed them to GPT-4o, Qwen2-Audio, and Gemini Experimental 1114. From my previous experiments, I already knew generally which systems were good and which were bad, so I had ground truth labels.

Qwen2-Audio was quite impressive at identifying characteristics of an individual audio but was not able to accurately rank samples side by side due to its architecture. GPT-4o was surprisingly totally useless. It would analyze the spectrogram but wouldn’t transcribe the audio or understand any characteristics like naturalness.

Fortunately, the fine folks at Google delivered. Gemini Experimental 1114 is an excellent judge of audio. It transcribes the audio, identifies issues with pace, prosody, tone, and so on and has a nearly 100% agreement with my own human judgements. I would recommend others experimenting with audio to give this a try. It may be totally possible to iterate on fine-tuning mixtures to a point with Gemini just like people did with GPT-4 in the old days. Gemini-as-a-judge coming soon for audio papers?

Hail King Gemini Pt II: #1 in the Chatbot Arena in math without reasoning tokens

Along with its impressive LLM-as-a-judge capabilities, Gemini took the #1 spot in the chatbot arena (#4 when controlling for style; they are overfitting a bit) and, more importantly, #1 for math. When o1-preview was released and blew away other models on reasoning tasks (science, math, coding, etc.), I was certainly impressed with the results, but I could wave my hands and say, “well, this is a new thing and no one has really tried inference-time reasoning, so maybe this is to be expected.”

The latest Gemini model seems to perform on bar (based on the Chabot Arena) or a bit worse (based on me trying a few GPQA and AIME questions) than o1-preview with no reasoning tokens. As far as I can tell, this is just a normal chatbot with substantially improved quantitative skills. The model does tend to respond using a CoT-style prompt, so their approach isn’t totally different, but Gemini comes to the answer much more quickly with many fewer tokens. It’s totally possible that these are two variants of the same thing and Gemini is just showing the RL’ed reasoning tokens to the user while o1-preview hides them, but I think something different and cool is going on. I look forward to the technical report.

EDIT: as this was going to press, OpenAI just reclaimed the top spot with ChatGPT-4o-latest, so apparently they also have figured out how to solve these hard problems without reasoning tokens.

AI product management with Kevin and Mike

Two former Meta execs are now at the product helm for two of the leading AI companies. They sat down last week to share their thoughts on AI product management with the world. Their interview really resonated with me, so I wanted to pick out a few key takeaways.

All AI PMs are research PMs

Every PM working on an AI-first product (chatbot, companion, assistant, etc.) need to be fluent in model building. It used to be possible and common to separate the modeling from the feature for products involving recommender systems and content understanding. For most AI products, the model is the feature. Without a deep understanding of how the model was trained and evaluated, it is very hard to build and iterate on a compelling product experience.

Corollary: Model quality is product quality 

The experiences enabled by AI products are gated on the model quality. ChatGPT would not have been compelling with GPT-2. Artifacts does not make sense without very strong coding abilities. SearchGPT does not work without strong grounding. While I do think that we can do a lot more to build new products and features with existing models, ultimately model quality is at the foundation of everything that exists today.

AI PMs fluently write evals

AI products can generally do an uncountable number of things. Because they are so flexible and multifaceted, writing a PRD is almost an unbounded task. I am very confident that no PM at OpenAI sat down and listed everything that ChatGPT should do before launch.

So how do AI PMs communicate product capabilities? They do it through evals. In addition to selecting pre-existing evals and performance targets, they help create new ones. Want a model that can write better poetry? What does that mean? The best way to communicate that is to assemble N prompts and golden responses and design a way to determine if the model is doing a good job at the task. The final model will be evaluated on that and other important characteristics. This is much cleaner than writing a general description of a capability, which different people can interpret in different ways.

Corollary: AI PMs look at data

In addition to going deep on evals, AI PMs go deep on data. Post-training is the most PM-heavy part of the model training process and it’s all about the data mixture. AI PMs randomly sample and carefully examine the mixture and develop an intuitive sense as to whether they will deliver the desired outcome.

When evals are run, even if results are as expected, the AI PM looks hard at individual examples where the model was scored as either succeeding or failing to make sure that the numbers are telling the whole story.

AI PMs are AI native

AI PMs use AI at work all day. Need to automate a boring task: ask an AI. Need to develop a prototype: build it with AI. Need to learn a new topic: tell an AI to teach it to you. This may seem obvious, but it’s not. An AI PM should look to an AI before a human for every task. If it’s not possible, a new product idea is born and added to the backlog.

China is cooking

Last issue, I covered Tencent’s very strong entrant into the open LLM arena. Since then, Alibaba has shipped several strong improvements to their Qwen series (2.5 coder outperforms GPT-4o and 3.5 Sonnet, long context) and DeepSeek announced access to their reasoning model than claims to dethrone o1-preview.

What does this mean?

  • Meta has serious competition in the open LLM space. While these models don’t seem to have my penetration into the developer ecosystem or enterprise, they are putting up great numbers. At some point, if pure model performance is good enough, developers will abandon Llama and its strong ecosystem for another model. The heat is on.
  • These Chinese companies are releasing very strong models while under chip embargo. The best Nvidia chips cannot be exported to China! Chinese companies can access US datacenters from the US, but I don’t think that’s what’s going on (some of this is published). These companies are being very clever with older hardware and frugal with FLOPs. I suspect these models would be a step better with access to 100k H100 clusters, which is coming. US-based leaderboards have largely been ignoring these releases, but all of these companies have a shot at the top as soon as they unlock the compute they need.

One thing I don’t understand is why these companies are releasing the models. DeepSeek does seem to be making a college try at the API business (they are really cheap), but it doesn’t seem super serious. As far as I can tell, Tencent and Alibaba’s chatbots are not available to people who don’t speak Chinese and their developer support is pretty weak on their models. Is it for recruiting? Prestige? Fun? I totally get Meta’s OSS AI strategy, but I’m a bit perplexed here.

Where does Mistral go?

Mistral just released a new frontier-is model and upgraded their chat interface. This is an impressive accomplishment, but I’m starting to wonder what the end game of the mid-tail of model providers is. When a few key members of the original Llama team peeled away to found Mistral, they very quickly released a qualitatively stronger 7B model than anything else out there. I was optimistic that they would keep running.

But two things happened. First, a huge flood of actors decided that open AI was a good idea. Developers now have a choice of great models from Meta, DeepSeek, Alibaba, Cohere, Tencent, Snowflake, x.AI, and probably others in addition to Mistral. A great open model is no longer enough to drive adoption and attention.

Second, OpenAI, Anthropic, Meta, and others just keep shipping. Their chat interfaces look nothing like they did six months ago. They format results nicely with images interleaved. They execute code. They search the web and provide references. Mistral does 70% of this with their new interface, but competitors have gotten sticky. It’s not super clear to me why consumers would switch.

My hypothesis is that they may pivot to enterprise, but this seems like a tough path as well. Cohere started in this direction, but a fine-tuned enterprise-specific foundation model from six months ago underperforms a generic one today. BloombergGPT and MedPaLM were both crushed by GPT-4 months after publication, despite their deep repositories of proprietary data and focused fine-tuning.

My current thinking is that the AI model landscape will be a very expensive winner-take-all proposition. I’m rooting for everyone to succeed, but I don’t quite see a path for Mistral and the other mid-market players. I tip my hat to Character for realizing this early and focusing on building the best AI companion product on top of the models arms race. 

The revenge of the AI grannies

If you haven’t already seen this, this is too funny not to share. Virgin Media O2 trained a granny-themed audio LLM named Daisy that will hold endless conversations about nothing. They have deployed it on scam calls to waste the scammers time and have managed to draw conversations out to almost an hour. I would love if I could deploy my own personal Daisy to every unknown number called and have another model clip the highlights and post to social media.