Diffusion-Parakeet

A natural sounding, conversational text-to-speech diffusion model

Overview

In a previous blog post, we described Parakeet, an autoregressive approach to generating natural, conversational speech. Here, we introduce our more recent work that explores using diffusion instead of autoregression.

(capability) Text prompt Audio
Celebrity Voice Cloning [S1] Hey, this is Joe Biden. This is a special message for Gee Zhu, wait, uh, Guh Zhu, sorry, I think I'm pronouncing this wrong. I heard you're working on, uh, music generation models.
Prompt:
Generated:
Conversational cloning (NotebookLM style) [S1] So we're talking about this paper, right, "Attention is All You Need." Sounds simple enough, like some self-help mantra or something. [S2] Yeah, totally. Like, pay attention, kids. [S1] But here's the thing - it's sneaky, man. This paper, it just comes out of nowhere in 2017 and completely upends the whole field of machine translation. (Claude-generated text prompt)"
Prompt:
Generated:
Conversational cloning (real podcast) [S1] I've always wanted to learn how to play the guitar. [S2] What kind of guitar do you have in mind? [S1] Um, I'm not sure, I guess I'd, uh, like to learn to play both acoustic and electric. [S2] Yeah, that's a great idea (laughs). Both types of guitars have their own, uh, their own unique sounds and, uh, and playing styles. (Text prompt from Google's SoundStorm)
Prompt:
Generated:
Singing cloning attempt [S1] (singing) Is this the real life, or is this just fantasy. Caught in a landslide, no escape from reality. Open your eyes, look up to the skies and see.
Prompt (autoencoder reconstruction, not good singing quality):
Generated: