flowtron icon indicating copy to clipboard operation
flowtron copied to clipboard

How does one cut one up a longer text so it fits into the available frames?

Open Quasimondo opened this issue 5 years ago • 6 comments

When running inference.py, texts that do not fit into n_frames are getting cropped from the end, so parts of the beginning are lost. It also looks like the duration of the spoken text depends on the speaker id (when using libriTTS). Increasing n_frames to get longer outputs seems to be limited by GPU memory, so it looks like one has to split the text into sentences, but I am wondering if there is any method to estimate how many frames a given string will require?

So far my best guess is to home in by try-and-error, using the returned attentions:

attention = torch.cat(attentions[0]).cpu().numpy()
if attention[-1].argmax()>0:
    print("text does not fit into available frame")

Quasimondo avatar May 15 '20 13:05 Quasimondo

It depends on the speech rate. A simple approximation is 8 frames per token.

rafaelvalle avatar May 15 '20 15:05 rafaelvalle

Ah, good to know. And the total length of tokens is the length of the tensor after the cleaning and transformation: trainset.get_text(text)?

Quasimondo avatar May 15 '20 15:05 Quasimondo

After calculating test sentences for all the speakers it seems like the approximation factor is closer to 6, though I am not sure how sigma and temperature factor further into this.

Quasimondo avatar May 15 '20 16:05 Quasimondo

Yes, length of the tensor after cleaning and transformation. Add some headroom to account for variation coming from your sigma value.

rafaelvalle avatar May 15 '20 16:05 rafaelvalle

Ah okay - I guess I have to run my estimator again for different sigmas, to see how it affects the factors. Anyway, here is my preliminary lookup table, maybe it's useful: https://gist.github.com/Quasimondo/eadebf73796a10d624b9f98092a9b81f

Quasimondo avatar May 15 '20 16:05 Quasimondo

Thank you for compiling this!

rafaelvalle avatar May 15 '20 17:05 rafaelvalle