Real-Time-Voice-Cloning synthesizer params varries for same input audio and text??

If run a the demo.cli.py for same audio and text multiple time i can see variation in synthesizer EX-Synthesizing the waveform: {| ████████████████ 57000/57600 | Batch Size: 6 | Gen Rate: 5.1kHz | }float64

Synthesizing the waveform: {| ████████████████ 47500/48000 | Batch Size: 5 | Gen Rate: 4.3kHz | }float64

Synthesizing the waveform: {| ████████████████ 47500/48000 | Batch Size: 5 | Gen Rate: 4.2kHz | }float64

Synthesizing the waveform: {| ████████████████ 57000/57600 | Batch Size: 6 | Gen Rate: 5.2kHz | }float64

Same audio and text but output is ranging? do you know what can be the reason?

May 05 '22 12:05 ayush431

The output varies because dropout is used in inference, in the encoder and decoder prenets. Dropout causes some tensor elements to be zeroed out at random. Its purpose is to help the model generalize in training, but as the Tacotron authors explain, it is preserved for inference to introduce some variation in the output. For completely deterministic output, use the --seed option (it causes the random number generator to be initialized to the same state when generating each time).

May 06 '22 05:05 raccoonML

Can you tell me how we can use this --seed option ??

May 06 '22 06:05 ayush431

It's a command line argument for demo_cli.py and demo_toolbox.py. You also need to specify the value of the seed. For example:

python demo_cli.py --seed 0
python demo_toolbox.py --seed 0

May 06 '22 06:05 raccoonML

Ok thank you so much

May 06 '22 06:05 ayush431

After providing the seed option still Gen rate is varrying for same audio and text - Synthesizing the waveform: {| ████████████████ 171000/172800 | Batch Size: 18 | Gen Rate: 3.4kHz | }float64 Synthesizing the waveform: {| ████████████████ 171000/172800 | Batch Size: 18 | Gen Rate: 4.6kHz | }float64 Synthesizing the waveform: {| ████████████████ 171000/172800 | Batch Size: 18 | Gen Rate: 4.2kHz | }float64

May 06 '22 08:05 ayush431

Notice how the synthesized output is now identical in each case, with a length of 172800 timesteps. Some variation in inference speed is normal, and does not affect output.

May 06 '22 09:05 raccoonML

While providing seed option how do we know which value gives most correct output?

May 06 '22 09:05 ayush431

You don't currently, its like minecraft seeds. Just gotta try ur luck.

May 25 '22 20:05 TrycsPublic

Real-Time-Voice-Cloning Real-Time-Voice-Cloning copied to clipboard

synthesizer params varries for same input audio and text??

Real-Time-Voice-Cloning
Real-Time-Voice-Cloning copied to clipboard