parler-tts icon indicating copy to clipboard operation
parler-tts copied to clipboard

Long-form synthesis

Open fakerybakery opened this issue 10 months ago • 5 comments

Hi, Congrats on the release!! Is long form synthesis planned? Thank you!

fakerybakery avatar Apr 10 '24 23:04 fakerybakery

Currently we train on a maximum of 30-second audios. With @ylacombe we're looking at increasing the context length to potentially longer audio lengths. Alibi embeddings (or a variant thereof) look promising for this https://arxiv.org/abs/2108.12409

As a future works, it would be amazing if you could feed an entire chapter of an audiobook to the model, and have it learn the prosody and intonation directly from training examples (with no guidance from the text prompt)

sanchit-gandhi avatar Apr 11 '24 11:04 sanchit-gandhi

That would be nice. I was wondering if it would be possible to use chunking, and have previous chunks as context, to make the speech sound natural with different speakers. (This would be nice for audiobooks with multiple characters.)

fakerybakery avatar Apr 11 '24 17:04 fakerybakery

Currently we train on a maximum of 30-second audios. With @ylacombe we're looking at increasing the context length to potentially longer audio lengths. Alibi embeddings (or a variant thereof) look promising for this https://arxiv.org/abs/2108.12409

As a future works, it would be amazing if you could feed an entire chapter of an audiobook to the model, and have it learn the prosody and intonation directly from training examples (with no guidance from the text prompt)

Is there any updates aobut the long-form speech synthesis? I'm looking forward to the results. What's more, for the future works you mentioned, it sounds more applicable in the audiobook scene. But I'm curious about what the voice be like. A pre-defined voice?

lmxue avatar May 02 '24 08:05 lmxue

Attached is an example of a "longish" form TTS using the large model.

The source text is a couple of starting sentences from: https://reactormag.com/reprints-zeros-peter-watts/

After reducing substantially the input text length, the output from TTS is fine. I'm not sure how 30 seconds translates to a number of words, but in my testing parler-TTS "breaks" somewhere below 100 words.

Started testing batch processing of sentence-split longer form now.

some 200 words.zip

mjaniec2013 avatar Aug 26 '24 20:08 mjaniec2013

When I split a text into 14 sentences, ran batch TTS, and combined the output, each sentence was narrated with a different voice, despite using the same voice definition for each sentence in the input tokenization:

inputs = tts_tokenizer([voice_description] * sentence_count, return_tensors="pt", padding=True).to(torch_device)

Additionally, the volume of the voices varied. While some generated sentences were of good quality, others (a minority) were nearly unintelligible.

Basic Model Settings:

  • Model: parler-tts/parler-tts-mini-v1
  • Torch dtype: torch.bfloat16
  • Torch device: cuda:0
  • Attention: eager

mjaniec2013 avatar Aug 27 '24 00:08 mjaniec2013