FastSpeech2 icon indicating copy to clipboard operation
FastSpeech2 copied to clipboard

Manual Control of Phoneme Durations

Open hypnaceae opened this issue 3 years ago • 5 comments

I'd like to supply the synthesiser with custom phoneme durations (i.e start and end time of each phoneme), in other words bypassing and replacing phoneme duration prediction with my own parameters. Is it possible to do this in this implementation?

hypnaceae avatar Aug 03 '21 17:08 hypnaceae

Yes, you can !

https://github.com/ming024/FastSpeech2/blob/d4e79eb52e8b01d24703b2dfc0385544092958f3/model/fastspeech2.py#L43-L58

Set the d_targets to your custom durations (the default value is None and model will predict the durations)

leminhnguyen avatar Aug 04 '21 03:08 leminhnguyen

Great, thanks! Can I also ask, what data & type does this variable accept? Just a list of phoneme durations? I have tried a variety of types (example: d_targets = [0.5, 0.1, 0.25] as durations for phonemes "K AE1 T", a dict as phoneme:duration, etc) but none have worked. What's the exact usage here? Thanks again.

hypnaceae avatar Aug 04 '21 14:08 hypnaceae

@hypnaceae

The training data was a great example for you. When training you will push the ground truth value of duration, pitch and energy to d_targets, p_targets and e_targets. So please inspect the preprocessed files (ends with .npy) for more details.

The d_targets must be the int array which each of element indicates the length (number of frames in mel-spectrogram) of each phoneme. Ex: d_targets = [3, 4, 5] for phoneme sequence "K AE1 T".

leminhnguyen avatar Aug 05 '21 02:08 leminhnguyen

Thanks. I set d_targets (FastSpeech2.py, line 54) to your example and I'm getting the following traceback.

>>synthesize.py --text "cat" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml

Removing weight norm...
Raw Text Sequence: cat
Phoneme Sequence: {K AE1 T}
Traceback (most recent call last):
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\synthesize.py", line 214, in <module>
    synthesize(model, args.restore_step, configs, vocoder, batchs, control_values)
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\synthesize.py", line 99, in synthesize
    d_control=duration_control
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\venv\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\model\fastspeech2.py", line 91, in forward
    d_control,
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\venv\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\model\modules.py", line 129, in forward
    x, mel_len = self.length_regulator(x, duration_target, max_len)
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\venv\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\model\modules.py", line 194, in forward
    output, mel_len = self.LR(x, duration, max_len)
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\model\modules.py", line 171, in LR
    expanded = self.expand(batch, expand_target)
  File "C:\Users\asdf\PycharmProjects\FastSpeech2\model\modules.py", line 186, in expand
    expand_size = predicted[i].item()
TypeError: 'int' object is not subscriptable

It looks like predicted is taking the first value of the d_targets array, in this case 3.

To clarify: I want to specify the number of mel spectrogram frames on a per-phoneme basis at synthesis time. I am also not training my own models (just using the pretrained LJSpeech model) so I can't see any .npy files.

Thanks again, you've been a big help thus far.

hypnaceae avatar Aug 05 '21 15:08 hypnaceae

has anyone worked on this anymore?

debasishaimonk avatar Jul 19 '23 07:07 debasishaimonk