TTS
TTS copied to clipboard
[Bug] Synthesizer.py does not compute speaker_embedding for inference on fine-tuned YourTTS
Describe the bug
Using a YourTTS model which has been fine-tuned on a VCTK format dataset with a single speaker. This model requires that the speaker_embeddings be given as d_vectors for inference on a learned speaker. However, the following code in TTS/utils/synthesizer.py
# handle Neon models with single speaker.
if len(self.tts_model.speaker_manager.name_to_id) == 1:
causes execution to skip over the check for 'use_d_vector_file' in the following elif block. This causes speaker_embedding to be left as None, and the program later crashes due to a conv1d layer receiving None as its input.
This is fixed for me if I copy the speaker_embedding calculation to be included in the first if statement. I.e.
# handle Neon models with single speaker.
if len(self.tts_model.speaker_manager.name_to_id) == 1:
speaker_id = list(self.tts_model.speaker_manager.name_to_id.values())[0]
if self.tts_config.use_d_vector_file:
# get the average speaker embedding from the saved d_vectors.
speaker_embedding = self.tts_model.speaker_manager.get_mean_embedding(
speaker_name, num_samples=None, randomize=False
)
speaker_embedding = np.array(speaker_embedding)[None, :] # [1 x embedding_dim]
To Reproduce
Use the train_yourtts.py recipe to finetune a VCTK format dataset which only has a single speaker. I used the experiment 1 checkpoint of YourTTS as a restore point since this was configured for a single language.
Use the tts command to perform inference using the custom model:
tts --text "Hello" --out_path test.wav --model_path <model-path>/best_model.pth --config_path <model-path>/config.json --speaker_idx <speaker_name>
See the following error:
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/coqui-local/bin/tts", line 8, in <module>
sys.exit(main())
File "/home/ubuntu/TTS/TTS/bin/synthesize.py", line 357, in main
wav = synthesizer.tts(
File "/home/ubuntu/TTS/TTS/utils/synthesizer.py", line 287, in tts
outputs = synthesis(
File "/home/ubuntu/TTS/TTS/tts/utils/synthesis.py", line 216, in synthesis
outputs = run_model_torch(
File "/home/ubuntu/TTS/TTS/tts/utils/synthesis.py", line 50, in run_model_torch
outputs = _func(
File "/home/ubuntu/miniconda3/envs/coqui-local/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/ubuntu/TTS/TTS/tts/models/vits.py", line 1162, in inference
o = self.waveform_decoder((z * y_mask)[:, :, : self.max_inference_len], g=g)
File "/home/ubuntu/miniconda3/envs/coqui-local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/TTS/TTS/vocoder/models/hifigan_generator.py", line 250, in forward
o = o + self.cond_layer(g)
File "/home/ubuntu/miniconda3/envs/coqui-local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/coqui-local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 313, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/home/ubuntu/miniconda3/envs/coqui-local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
TypeError: conv1d() received an invalid combination of arguments - got (NoneType, Parameter, Parameter, tuple, tuple, tuple, int), but expected one of:
* (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, tuple of ints padding, tuple of ints dilation, int groups)
didn't match because some of the arguments have invalid types: (NoneType, Parameter, Parameter, tuple, tuple, tuple, int)
* (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, str padding, tuple of ints dilation, int groups)
didn't match because some of the arguments have invalid types: (NoneType, Parameter, Parameter, tuple, tuple, tuple, int)
Expected behavior
The expected behaviour is the regular synthesis of audio using the model.
I can achieve this behaviour by either commenting out the if statement intended for single-speaker Neon models. Or by adding the speaker_embedding calculation to this if statement as shown above.
Logs
No response
Environment
{
"CUDA": {
"GPU": [
"NVIDIA GeForce GTX 1070"
],
"available": true,
"version": "11.7"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "1.13.1+cu117",
"TTS": "0.10.2",
"numpy": "1.23.5"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
"ELF"
],
"processor": "x86_64",
"python": "3.10.9",
"version": "#1 SMP Fri Jan 20 09:54:01 NZDT 2023"
}
}
Additional context
No response
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.
I got the same bug. Did it get resolved?
You don't need speaker encoder for a single-speaker model. You can disable it and just restore the model to finetune.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.
Was this issue solved? I have the same problem.
- Took the pretrained YourTTS model: mutlilingual-multi-dataset-your_tts
- Fine tuned on new speaker for 10k steps
- Stopped training just to check some progress
- Ran tts command like the initial post here, but with just "text", "out_path", "model_path", "config_path" as variables, pointing model_path & config_path to the new fine tuned model on 10k steps.
- Got the same error as the initial post mentions
- Also tried with speakers_file_path & speaker_idx, with the same error:
This is the error:
TypeError: conv1d() received an invalid combination of arguments - got (NoneType, Parameter, Parameter, tuple, tuple, tuple, int), but expected one of: (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, tuple of ints padding, tuple of ints dilation, int groups) didn't match because some of the arguments have invalid types: (NoneType, Parameter, Parameter, tuple of (int,), tuple of (int,), tuple of (int,), int)
I can confirm @RedSoutherly 's solution works great.
@erogol Sorry about my git mixup, I had to resubmit the fix. Thank you for closing out this issue :)
ONNX inference not working for multi-speaker
Hi everyone, I tried converting multi speaker to ONNX but seems the inference feature could not support it
TypeError: conv1d() received an invalid combination of arguments - got (NoneType, Parameter, Parameter, tuple, tuple, tuple, int), but expected one of:
- (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, tuple of ints padding, tuple of ints dilation, int groups) didn't match because some of the arguments have invalid types: (!NoneType!, !Parameter!, !Parameter!, !tuple of (int,)!, !tuple of (int,)!, !tuple of (int,)!, int)
- (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, str padding, tuple of ints dilation, int groups) didn't match because some of the arguments have invalid types: (!NoneType!, !Parameter!, !Parameter!, !tuple of (int,)!, !tuple of (int,)!, !tuple of (int,)!, int)
This is my code:
config = VitsConfig() config.load_json("/content/drive/MyDrive/Technical/TTS/TTS API Files/elsie_config.json") vits = Vits.init_from_config(config) vits.load_checkpoint(config, "/content/drive/MyDrive/Technical/TTS/TTS API Files/checkpoint_1054000.pth")
vits.export_onnx() vits.load_onnx("coqui_vits.onnx")
Please is there a solution to fixing the bug. Thank you
Describe the bug
Using a YourTTS model which has been fine-tuned on a VCTK format dataset with a single speaker. This model requires that the speaker_embeddings be given as d_vectors for inference on a learned speaker. However, the following code in TTS/utils/synthesizer.py
# handle Neon models with single speaker. if len(self.tts_model.speaker_manager.name_to_id) == 1:
causes execution to skip over the check for 'use_d_vector_file' in the following elif block. This causes speaker_embedding to be left as None, and the program later crashes due to a conv1d layer receiving None as its input.
This is fixed for me if I copy the speaker_embedding calculation to be included in the first if statement. I.e.
# handle Neon models with single speaker. if len(self.tts_model.speaker_manager.name_to_id) == 1: speaker_id = list(self.tts_model.speaker_manager.name_to_id.values())[0] if self.tts_config.use_d_vector_file: # get the average speaker embedding from the saved d_vectors. speaker_embedding = self.tts_model.speaker_manager.get_mean_embedding( speaker_name, num_samples=None, randomize=False ) speaker_embedding = np.array(speaker_embedding)[None, :] # [1 x embedding_dim]
To Reproduce
Use the train_yourtts.py recipe to finetune a VCTK format dataset which only has a single speaker. I used the experiment 1 checkpoint of YourTTS as a restore point since this was configured for a single language.
Use the tts command to perform inference using the custom model:
tts --text "Hello" --out_path test.wav --model_path <model-path>/best_model.pth --config_path <model-path>/config.json --speaker_idx <speaker_name>
See the following error:
Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/coqui-local/bin/tts", line 8, in <module> sys.exit(main()) File "/home/ubuntu/TTS/TTS/bin/synthesize.py", line 357, in main wav = synthesizer.tts( File "/home/ubuntu/TTS/TTS/utils/synthesizer.py", line 287, in tts outputs = synthesis( File "/home/ubuntu/TTS/TTS/tts/utils/synthesis.py", line 216, in synthesis outputs = run_model_torch( File "/home/ubuntu/TTS/TTS/tts/utils/synthesis.py", line 50, in run_model_torch outputs = _func( File "/home/ubuntu/miniconda3/envs/coqui-local/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/ubuntu/TTS/TTS/tts/models/vits.py", line 1162, in inference o = self.waveform_decoder((z * y_mask)[:, :, : self.max_inference_len], g=g) File "/home/ubuntu/miniconda3/envs/coqui-local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/ubuntu/TTS/TTS/vocoder/models/hifigan_generator.py", line 250, in forward o = o + self.cond_layer(g) File "/home/ubuntu/miniconda3/envs/coqui-local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/ubuntu/miniconda3/envs/coqui-local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 313, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/ubuntu/miniconda3/envs/coqui-local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward return F.conv1d(input, weight, bias, self.stride, TypeError: conv1d() received an invalid combination of arguments - got (NoneType, Parameter, Parameter, tuple, tuple, tuple, int), but expected one of: * (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, tuple of ints padding, tuple of ints dilation, int groups) didn't match because some of the arguments have invalid types: (NoneType, Parameter, Parameter, tuple, tuple, tuple, int) * (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, str padding, tuple of ints dilation, int groups) didn't match because some of the arguments have invalid types: (NoneType, Parameter, Parameter, tuple, tuple, tuple, int)
Expected behavior
The expected behaviour is the regular synthesis of audio using the model.
I can achieve this behaviour by either commenting out the if statement intended for single-speaker Neon models. Or by adding the speaker_embedding calculation to this if statement as shown above.
Logs
No response
Environment
{ "CUDA": { "GPU": [ "NVIDIA GeForce GTX 1070" ], "available": true, "version": "11.7" }, "Packages": { "PyTorch_debug": false, "PyTorch_version": "1.13.1+cu117", "TTS": "0.10.2", "numpy": "1.23.5" }, "System": { "OS": "Linux", "architecture": [ "64bit", "ELF" ], "processor": "x86_64", "python": "3.10.9", "version": "#1 SMP Fri Jan 20 09:54:01 NZDT 2023" } }
Additional context
No response
Hi @RedSoutherly. I tried your solution by making changes to the vits model:
handle Neon models with single speaker. if len(self.tts_model.speaker_manager.name_to_id) == 1: speaker_id = list(self.tts_model.speaker_manager.name_to_id.values())[0] if self.tts_config.use_d_vector_file: # get the average speaker embedding from the saved d_vectors. speaker_embedding = self.tts_model.speaker_manager.get_mean_embedding( speaker_name, num_samples=None, randomize=False ) speaker_embedding = np.array(speaker_embedding)[None, :] # [1 x embedding_dim]
However I still came across this bug:
TypeError: conv1d() received an invalid combination of arguments - got (NoneType, Parameter, Parameter, tuple, tuple, tuple, int), but expected one of:
(Tensor input, Tensor weight, Tensor bias, tuple of ints stride, tuple of ints padding, tuple of ints dilation, int groups) didn't match because some of the arguments have invalid types: (!NoneType!, !Parameter!, !Parameter!, !tuple of (int,)!, !tuple of (int,)!, !tuple of (int,)!, int) (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, str padding, tuple of ints dilation, int groups) didn't match because some of the arguments have invalid types: (!NoneType!, !Parameter!, !Parameter!, !tuple of (int,)!, !tuple of (int,)!, !tuple of (int,)!, int)
I would want to know if there is a way out please. Thanks
I got a similar kind of error but after close observation I understood that I have not given speaker_idx as input argument. Similarly check whether all input arguments were given correctly or not