WhisperSpeech icon indicating copy to clipboard operation
WhisperSpeech copied to clipboard

Investigate prompting as a tool to zero-shot condition both the S2A and T2S models

Open jpc opened this issue 1 year ago • 2 comments
trafficstars

This could also allow us to:

  1. zero-shot voice (and prosody) clone existing recording
  2. generate some random samples and then freeze one style we like most for subsequent generations.

jpc avatar Jan 19 '24 10:01 jpc

Hi, for pt. 2 (freezing one style), have you considered StyleTTS 2's approach (see section B.3)?

Our findings indicate that style diffusion creates significant variation in samples, a characteristic that poses challenges for long-form synthesis. In this scenario, a long paragraph is usually divided into smaller sentences for generation, sentence by sentence, in the same way as real-time applications. Using an independent style for each sentence may generate speech that appears inconsistent due to differences in speaking styles. Conversely, maintaining the same style from the first sentence throughout the entire paragraph results in monotonic, unnatural, and robotic-sounding speech.

We empirically observe that the latent space underlying the style vectors generally forms a convex space. Consequently, a convex combination of two style vectors yields another style vector, with the speaking style somewhere between the original two. This allows us to condition the style of the current sentence on the previous sentence through a simple convex combination. The pseudocode of this algorithm, which uses interpolated style vectors, is provided in Algorithm 1.

fakerybakery avatar Jan 19 '24 16:01 fakerybakery

Hey, thanks for the tip. I skimmed the StyleTTS 2 paper before but maybe I'll read it again more carefully. :)

jpc avatar Jan 28 '24 00:01 jpc