FastSpeech2 Manual Control of Phoneme Durations

Manual Control of Phoneme Durations

Open hypnaceae opened this issue 3 years ago • 5 comments

I'd like to supply the synthesiser with custom phoneme durations (i.e start and end time of each phoneme), in other words bypassing and replacing phoneme duration prediction with my own parameters. Is it possible to do this in this implementation?

Aug 03 '21 17:08 hypnaceae

Yes, you can !

https://github.com/ming024/FastSpeech2/blob/d4e79eb52e8b01d24703b2dfc0385544092958f3/model/fastspeech2.py#L43-L58

Set the d_targets to your custom durations (the default value is None and model will predict the durations)

Aug 04 '21 03:08 leminhnguyen

Great, thanks! Can I also ask, what data & type does this variable accept? Just a list of phoneme durations? I have tried a variety of types (example: d_targets = [0.5, 0.1, 0.25] as durations for phonemes "K AE1 T", a dict as phoneme:duration, etc) but none have worked. What's the exact usage here? Thanks again.

Aug 04 '21 14:08 hypnaceae

@hypnaceae

The training data was a great example for you. When training you will push the ground truth value of duration, pitch and energy to d_targets, p_targets and e_targets. So please inspect the preprocessed files (ends with .npy) for more details.

The d_targets must be the int array which each of element indicates the length (number of frames in mel-spectrogram) of each phoneme. Ex: d_targets = [3, 4, 5] for phoneme sequence "K AE1 T".

Aug 05 '21 02:08 leminhnguyen

Thanks. I set d_targets (FastSpeech2.py, line 54) to your example and I'm getting the following traceback.

>>synthesize.py --text "cat" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml

Removing weight norm...
Raw Text Sequence: cat
Phoneme Sequence: {K AE1 T}
Traceback (most recent call last):
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\synthesize.py", line 214, in <module>
    synthesize(model, args.restore_step, configs, vocoder, batchs, control_values)
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\synthesize.py", line 99, in synthesize
    d_control=duration_control
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\venv\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\model\fastspeech2.py", line 91, in forward
    d_control,
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\venv\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\model\modules.py", line 129, in forward
    x, mel_len = self.length_regulator(x, duration_target, max_len)
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\venv\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\model\modules.py", line 194, in forward
    output, mel_len = self.LR(x, duration, max_len)
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\model\modules.py", line 171, in LR
    expanded = self.expand(batch, expand_target)
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\model\modules.py", line 186, in expand
    expand_size = predicted[i].item()
TypeError: 'int' object is not subscriptable

It looks like predicted is taking the first value of the d_targets array, in this case 3.

To clarify: I want to specify the number of mel spectrogram frames on a per-phoneme basis at synthesis time. I am also not training my own models (just using the pretrained LJSpeech model) so I can't see any .npy files.

Thanks again, you've been a big help thus far.

Aug 05 '21 15:08 hypnaceae

has anyone worked on this anymore?

Jul 19 '23 07:07 debasishaimonk

FastSpeech2 FastSpeech2 copied to clipboard

Manual Control of Phoneme Durations

FastSpeech2
FastSpeech2 copied to clipboard