How to build an AI eval: Healthbench edition
A few weeks ago, I wrote a post sharing my thoughts on how PMs can contribute to building evals, using one I wrote as a guide.
While I’m sure that it is a coincidence, OpenAI followed my framework perfectly with HealthBench, their latest OSS’ed eval on medical diagnosis. If you are passionate about evaluating gen AI systems, I highly recommend you read their paper carefully.
Most importantly, look at how they created their judge rubric. Determining whether an answer to a complex health question is correct is very challenging to do. I suspect that interannotator agreement between even trained specialist physicians would be low. But they broke the problem into a number of subproblems, each of which can be accurately scored by both an LLM judge and panel of human reviewers.
In addition, they explicitly mentioned that they calibrated the benchmark to the range of difficulty that could separate current models and provide some room before saturation and tied this evaluation very closely to their product outcomes: not making a catastrophic mistake and getting help understanding a medical problem.
Doctors are cooked: Healthbench edition
Healthbench is not only a great eval. It is also a great story.
Last year, a group of academic physicians showed that GPT-4 outperformed its human peers at diagnosis. Importantly, this research was driven by the physicians and not an AI lab.
With Healthbench, OpenAI partially replicated and expanded upon that work. As with the previous effort, they showed that AI alone significantly outperformed physicians without AI, but this time around physicians with AI performed similarly the AI alone rather than worse. I could chalk this up to more physicians being familiar with AI and likely using it in their practice whereas a year ago when the previous study was performed, they may have been more skeptical.
I think the message to physicians and patients is clear. If you aren’t using AI to help guide your care, you aren’t receiving the best care. Anecdotally, having been involved in a few complex medical decisions recently, I can confirm. The best AI models significantly outperform the median physician on all tasks and even match/confirm the conclusions of academic specialists. In my real-world experience, I have not encountered a single case where the AI was wrong and the human was correct.
Is there a startup pursuing this or is the startup “doctors use ChatGPT”? I would be curious to understand the perspective of people closer to this.
Wut benchmarks? Vibes-based shipping brought to you by Gemini and (retroactively) ChatGPT
People in AI love looking at tables of numbers. Given how good models have gotten recently, it is very hard to understand which is better without a carefully curated hard personal prompt set. A standard eval like GPQA is a great shorthand for figuring out what model to use.
That said, both Google and OpenAI shipped models recently that regressed on essentially every benchmark based on vibes (and presumably some internal evals). Gemini 2.5 Pro’s latest model underperforms the prod across the board, but people seem to love it. OpenAI just rolled back a model that they say had better numbers because people didn’t like it.
Is this the beginning of the end of benchmarks? Are models strong enough that it’s just about how they make users feel now? Or are the public benchmarks no longer representative of real-world use-cases and Google made the call with an internal set?
Weak model + strong product = victory?
This is more of a hypothetical question. We see a clear case of an underperforming product powered by a strong model in Gemini, which Josh Woodward is clearly trying quite hard to fix.
Are there any examples of the opposite? Is it possible to succeed with an AI product without the best model?
Goodhart has come for the Chatbot Arena, formalized
We’ve been saying that Goodhard has come for the Chatbot Arena for over a year here, but this intuition was always based on vibes and hand-waving.
Some researchers at Cohere and elsewhere just formalized this argument showing that top labs essentially p-hack the arena to come out on top. Their analysis is interesting, but I would push back a bit on the notion that “these dynamics result in overfitting to Arena-specific dynamics rather than general model quality”. Defining model quality is somewhat of a theme of this issue.
Llama 4 got in a lot of trouble for submitting a chat-optimized model into the Arena, but I would argue that for Meta, this is model quality. We are a social company and chattiness and enthusiasm probably make sense. If you are Anthropic optimizing for code, why would you even care what two random users think of your model’s response? You care if users get their projects done (this is already reflected to some extent by their relatively poor Arena scores).
The real problem is that the Chatbot Arena has somehow evolved into THE measure of LLM quality when really it should be considered as one of many.