Lessons from writing this newsletter

I started writing this newsletter 18 months ago with the explicit goal of forcing myself to more deeply understand what I read (“Men learn while they teach.” –Seneca the Younger). I had no expectations, but I am quite happy to share that writing this has not only accomplished my original goal but so much more.

Through my writing, I’ve met many people both IRL and online, helped several people join our team, won a PMmmy, and improved how I structure my thoughts more generally. Writing has been a lot of work, but I recommend it to anyone wanting to learn more about a topic. I occasionally get asked for my tips on getting started, so I wanted to share here:

Muse over cadence. I started by trying to write once a week. But there are some weeks where decades happen and you need to adapt. I’ve transitioned to writing whenever I feel like I have something to say. As of this week, I renamed this newsletter “Dan’s Voice AI Voice” to reflect the new cadence and that apparently speech is called voice these days.
Consistency is key. I self-committed to writing this newsletter for a year. There were weeks where few people read it and weeks where I got tons of feedback. If I hadn’t had made that commitment, I would have quit for sure.
Find your voice. Working in big tech programs you to communicate in a specific way. This programming is designed to maximize clarity across a very large organization but is sterile and saps enjoyment from writing. It took me a few laps to realize that this was my newsletter and I could write whatever I wanted.

NTP is all you need part II: A collection of SOTA ASR models

Next-token prediction has come for audio. Every week, a team bolts together an OSS tokenizer and LLM backbone and trains it to do TTS. The first few took the internet by storm, but now no one bats an eye when they hear a hypernatural and expressive synthetic voice.

This approach has spread to ASR with a handful of releases starting with GPT-4o-transcribe and more recently to Mistral Voxtral and Kyutai’s STT. Because these are LLM-based systems, they can quite easily extend to multiple tasks including audio understanding, QA, diarization, VAD, and more. The line between traditional speech tasks and audio language modeling has essentially been totally blurred.

And hilariously, all of these models claim SOTA, which I think has lost all meaning in ASR. Any near frontier ASR model will transcribe essentially any academic dataset better than a human and each of these models conveniently benchmarks on a different dataset. Context ultimately drives ASR performance. Some teams need to target challenging medical domains and others casual social media videos.

But which T with Pooneh Mousavi

If you are just getting onto the NTP train for audio, figuring out which tokenizer makes the most sense for your use-case is half the battle (after all, these models are just tokenizer + data). Pooneh Mousavi, the hostess of the really cool Conversational AI reading group, put together a neat tool and paper comparing the tradeoffs around different tokenizers. The paper is super dense, but my general takeaway is that tokenizer selection really matters and needs to be matched to the task. My only feedback is that I wish I could listen to different examples reconstructed by different tokenizers to really internalize the strengths and weaknesses of these different algorithms.

ChatGPT for doctors is… OpenEvidence?

Last month, I posed the question as to whether someone needs to build ChatGPT for doctors or if ChatGPT itself is ChatGPT for doctors. Since then, my thinking has leaned toward the latter, but a company called OpenEvidence has come out of nowhere and apparently signed up 40% of US healthcare providers and raised at a $3.5B valuation.

I spent some time comparing ChatGPT responses (o3/4o/DeepResearch) to OpenEvidence on some hard medical questions and my conclusion is that OpenEvidence must have the most insanely good GTM team because I don’t get the product. In all cases, the vanilla ChatGPT responses were better. The only differentiator I could see is that OpenEvidence displayed a large “HIPAA Compliant” banner on my screen and more prominently linked to academic sources.

While our eyes all popped at the Cursor and Windsurf valuations, their OpenAI wrappers did create something meaningfully different not available in the core product. As far as I can tell, OpenEvidence is a nearly identical ChatGPT clone marketed as a specific group. If anyone can explain to me what’s going on here, please reach out.

Doctors are cooked pt IV

Microsoft created a new clinical benchmark, SDBench, composed of 304 sequential medical encounters (unlike OpenAI’s HealthBench, where a diagnosis is given after a single description of the patient’s condition). To me, this feels like a much more realistic scenario. It is rare that all the details of a medical condition are available at the outset and doctors frequently need to continue to probe and test to draw firm conclusions.

They also developed a unique scoring mechanism. They obviously tracked accuracy but also cost to deliver the care (note that this is not the cost of the tokens; this is the cost of the tests and appointments the model ordered).

Of course, as with most evals papers, they also built an orchestrated system that hit that precious Pareto frontier, but the most interesting elements in my mind were how other models did relative to humans. The best human score was 40%, spending $3000 on care. The mean human scores only 20% while spending the same. Meanwhile, GPT-4o (50%/$2,500), Gemini 2.5 Pro (70%/$4,500), o3 (80%/$7,500), and Microsoft’s own MAI-DxO (80%/$2,500) absolutely crush the humans.

This is the fourth major paper that shows AI significantly outperforms doctors at diagnostic tasks. If you aren’t working with an AI paired with a doctor for your healthcare, you are not getting the best care.

Verifiability maxxxing

I wrote a post on writing evals whose thesis was basically that AI PMs should start communicating through evals rather than more traditional product documents. Buried in the post was a discussion around the importance of designing consistent judges with a high degree of interannotator agreement, meaning the pool of annotators, whether humans, AIs, or algorithms, would consistently agree on a score across the domain where the eval is valid.

Jason Wei, a researcher at OpenAI who is apparently joining Meta, just wrote a nice blog post extending this idea to model training. He posits that any problem that can be easily verified will eventually be solved by AI. This is a profound statement. It means that if we are good enough at breaking down any problem into verifiable components, we can RL a model into superintelligent performance.

While this has already happened with math and code to some extent, how amazing will it be when we continue to build verifiable problems in domains like medicine where we already have HealthBench and SDBench?

Daniel D. McKinnon

Musings on adventuring in the modern era and tinkering with technological curiosities

Dan’s Voice AI Voice #46