Dan’s Weekly AI Speech and Language Scoop #39

Marketing is all you need (for OSS voice cloning)

Zyphra dropped a viral tweet announcing their OSS TTS model capable of very impressive voice cloning. The linked demo is truly amazing. I don’t know if I could differentiate the Zonos Obama from the actual thing.

Unfortunately, reality was not so kind to Zonos. I wasn’t able to reproduce this impressive performance with either their HF space or their hosted demo.

Despite lighting social media ablaze with their launch, the OSS VC throne is safe. And as always, if someone from Zyphra sees this, please let me know if I was doing something wrong and I will correct this post.

Llasa aka CloneLlama is the new sheriff in OSS VC town

Fortunately, the fine folks at the The Hong Kong University of Science and Technology are better at voice cloning than marketing, which in this case was done by a HuggingFace employee, and did release a new seemingly ZC VC model called Llasa. They extend Llama 3 to support text-to-speech and supposedly are able to capture the emotions of the underlying text. I was not able to actually demonstrate this behavior, but it seems cool in theory.

Fish Audio’s OSS sleight of hand

Playing with OSS models locally is a PITA. Even with perfect documentation, setting up a project can take an hour and most projects I’ve encountered require a significant amount of dependency debugging to function.

Many OSS companies offer a hosted version as a courtesy to try out. I previously reported that Fish Audio was the OSS king of ZC VC, but I realized this week that this is not true. While their website references OSS and their GitHub and HF profiles, the version hosted on their website is significantly better than the one hosted in their HF space (in their defense, their HF space does explain this, but getting there requires a lot more diligence than just clicking their demo link).

While Fish’s cloning is still strong, I now move them into the closed category where they need to compete with 11Labs, Play.ai, and others who are ahead. Their OSS ranking falls to below VoiceCraft and both of the options mentioned above.

And more broadly, I wonder how common this is? So few people check to see how the models run locally or bother to set up a space that I suspect a lot of OSS startup companies grab lots of attention with something that they imply is OSS but is not, as evidence above by the fact that I could not repro the Zonos demo above.

Kyutai does it again

Kyutai, the team behind the famously fun full-duplex Moshi model, dropped a real-time speech translation intern project called Hibiki. This doesn’t seem to any different or better than the Heygen or any number of companies who have been experimenting with this for a while, but they are very open with their approach and it builds upon their clever work with Moshi.

Gemini Deep Research is mid: do we blame the model or the web?

I subscribed to Gemini Advanced to tap into the Google Deep Research hype. I tapped in and left disappointed. Despite theoretically having access to the entire internet, this product totally and utterly failed to answer basic questions I had like:

  • What is the best zero-shot voice cloning project?
  • Can you put together an calendar of electronic music events in the Vail valley?
  • How should I adapt my lifting plan to my body and diet?


It wasn’t that it didn’t answer the questions. It did. The answers were just mid. Like I blended every spam website into some kind of tasteless porridge.


I tried to understand what was going on and I really couldn’t. The sources it cited were mostly garbage because the internet is mostly garbage. Presumably, Google could filter their compete corpus of the internet and find the best sources? Or their model could weigh input text by quality? Or maybe the internet is just so full of garbage this task is impossible without human intervention? I have no idea which is a bigger problem, but the product essentially did not work for me despite the hype.

For the questions that did not require up-to-date information, I found far better answers with OpenAI’s reasoning models, but OpenAI search performed just as poorly for the ones that did. I don’t have access to their copycat Deep Research product yet, but I look forward to side-by-side comparisons when I do. This is a hard problem.

From LIMA to LIMO

Researchers at Meta wrote a paper called LIMA, Less Is More Alignment, during the peak degen LLM instruction tuning days that showed that 1,000 high-quality examples are enough to achieve SOTA in human preference (aside: those were the good ol’ days).

Another team recently published LIMO, Less Is More Reasoning, that demonstrated a similar heuristic applies to reasoning models. They show that they can almost 10x performance on AIME and 2x performance on MATH from a base Qwen-32B model 817 curated training samples.

Alibaba does full duplex

Full duplex speech models are in vogue these days. The fundamental idea is that humans can listen while speaking, so humans should be able to as well. So far Kyutai, Standard Intelligence, and ByteDance have all released examples. 

Alibaba just threw their hat into the ring with MinMo. They achieve very strong academic metrics on AIR-bench and elsewhere for an 8B model and introduce a full-duplex decoder that switches the model from speaking to listening mode when the user interrupts. This technically isn’t true full duplex like Moshi and hertz-dev but rather gluing together VAD and audio LLM models, but it does 80% of the job.

Andreessen on voice agents

a16z just published a nice market map of AI voice agents. I didn’t read this carefully, but it really struck me that 22% of F24 YC companies are building with voice! We see a flood of vertical voice agents startups and a healthy chunk of model companies building audio LLMs.