FastSpeech2
FastSpeech2 copied to clipboard
Manual Control of Phoneme Durations
I'd like to supply the synthesiser with custom phoneme durations (i.e start and end time of each phoneme), in other words bypassing and replacing phoneme duration prediction with my own parameters. Is it possible to do this in this implementation?
Yes, you can !
https://github.com/ming024/FastSpeech2/blob/d4e79eb52e8b01d24703b2dfc0385544092958f3/model/fastspeech2.py#L43-L58
Set the d_targets
to your custom durations (the default value is None and model will predict the durations)
Great, thanks! Can I also ask, what data & type does this variable accept? Just a list of phoneme durations? I have tried a variety of types (example: d_targets = [0.5, 0.1, 0.25] as durations for phonemes "K AE1 T", a dict as phoneme:duration, etc) but none have worked. What's the exact usage here? Thanks again.
@hypnaceae
The training data was a great example for you. When training you will push the ground truth value of duration
, pitch
and energy
to d_targets
, p_targets
and e_targets
. So please inspect the preprocessed files (ends with .npy
) for more details.
The d_targets
must be the int
array which each of element indicates the length (number of frames in mel-spectrogram) of each phoneme. Ex: d_targets
= [3, 4, 5] for phoneme sequence "K AE1 T"
.
Thanks. I set d_targets (FastSpeech2.py, line 54) to your example and I'm getting the following traceback.
>>synthesize.py --text "cat" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
Removing weight norm...
Raw Text Sequence: cat
Phoneme Sequence: {K AE1 T}
Traceback (most recent call last):
File "C:\Users\asdf\PycharmProjects\FastSpeech2\synthesize.py", line 214, in <module>
synthesize(model, args.restore_step, configs, vocoder, batchs, control_values)
File "C:\Users\asdf\PycharmProjects\FastSpeech2\synthesize.py", line 99, in synthesize
d_control=duration_control
File "C:\Users\asdf\PycharmProjects\FastSpeech2\venv\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\asdf\PycharmProjects\FastSpeech2\model\fastspeech2.py", line 91, in forward
d_control,
File "C:\Users\asdf\PycharmProjects\FastSpeech2\venv\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\asdf\PycharmProjects\FastSpeech2\model\modules.py", line 129, in forward
x, mel_len = self.length_regulator(x, duration_target, max_len)
File "C:\Users\asdf\PycharmProjects\FastSpeech2\venv\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\asdf\PycharmProjects\FastSpeech2\model\modules.py", line 194, in forward
output, mel_len = self.LR(x, duration, max_len)
File "C:\Users\asdf\PycharmProjects\FastSpeech2\model\modules.py", line 171, in LR
expanded = self.expand(batch, expand_target)
File "C:\Users\asdf\PycharmProjects\FastSpeech2\model\modules.py", line 186, in expand
expand_size = predicted[i].item()
TypeError: 'int' object is not subscriptable
It looks like predicted
is taking the first value of the d_targets array, in this case 3
.
To clarify: I want to specify the number of mel spectrogram frames on a per-phoneme basis at synthesis time. I am also not training my own models (just using the pretrained LJSpeech model) so I can't see any .npy
files.
Thanks again, you've been a big help thus far.
has anyone worked on this anymore?