A natural sounding, conversational text-to-speech diffusion model
In a previous blog post, we described Parakeet, an autoregressive approach to generating natural, conversational speech. Here, we introduce our more recent work that explores using diffusion instead of autoregression.
(capability) | Text prompt | Audio |
---|---|---|
Celebrity Voice Cloning | [S1] Hey, this is Joe Biden. This is a special message for Gee Zhu, wait, uh, Guh Zhu, sorry, I think I'm pronouncing this wrong. I heard you're working on, uh, music generation models. | Prompt: Generated: |
Conversational cloning (NotebookLM style) | [S1] So we're talking about this paper, right, "Attention is All You Need." Sounds simple enough, like some self-help mantra or something. [S2] Yeah, totally. Like, pay attention, kids. [S1] But here's the thing - it's sneaky, man. This paper, it just comes out of nowhere in 2017 and completely upends the whole field of machine translation. (Claude-generated text prompt)" | Prompt: Generated: |
Conversational cloning (real podcast) | [S1] I've always wanted to learn how to play the guitar. [S2] What kind of guitar do you have in mind? [S1] Um, I'm not sure, I guess I'd, uh, like to learn to play both acoustic and electric. [S2] Yeah, that's a great idea (laughs). Both types of guitars have their own, uh, their own unique sounds and, uh, and playing styles. (Text prompt from Google's SoundStorm) | Prompt: Generated: |
Singing cloning attempt | [S1] (singing) Is this the real life, or is this just fantasy. Caught in a landslide, no escape from reality. Open your eyes, look up to the skies and see. | Prompt (autoencoder reconstruction, not good singing quality): Generated: |