TTS icon indicating copy to clipboard operation
TTS copied to clipboard

[Bug] Synthesizer.py does not compute speaker_embedding for inference on fine-tuned YourTTS

Open RedSoutherly opened this issue 2 years ago β€’ 5 comments

Describe the bug

Using a YourTTS model which has been fine-tuned on a VCTK format dataset with a single speaker. This model requires that the speaker_embeddings be given as d_vectors for inference on a learned speaker. However, the following code in TTS/utils/synthesizer.py

# handle Neon models with single speaker.
     if len(self.tts_model.speaker_manager.name_to_id) == 1:

causes execution to skip over the check for 'use_d_vector_file' in the following elif block. This causes speaker_embedding to be left as None, and the program later crashes due to a conv1d layer receiving None as its input.

This is fixed for me if I copy the speaker_embedding calculation to be included in the first if statement. I.e.

# handle Neon models with single speaker.
      if len(self.tts_model.speaker_manager.name_to_id) == 1:
          speaker_id = list(self.tts_model.speaker_manager.name_to_id.values())[0]
          if self.tts_config.use_d_vector_file:
              # get the average speaker embedding from the saved d_vectors.
              speaker_embedding = self.tts_model.speaker_manager.get_mean_embedding(
                  speaker_name, num_samples=None, randomize=False
              )
              speaker_embedding = np.array(speaker_embedding)[None, :]  # [1 x embedding_dim]

To Reproduce

Use the train_yourtts.py recipe to finetune a VCTK format dataset which only has a single speaker. I used the experiment 1 checkpoint of YourTTS as a restore point since this was configured for a single language.

Use the tts command to perform inference using the custom model:

tts --text "Hello" --out_path test.wav --model_path <model-path>/best_model.pth --config_path <model-path>/config.json --speaker_idx <speaker_name>

See the following error:

Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/coqui-local/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/TTS/TTS/bin/synthesize.py", line 357, in main
    wav = synthesizer.tts(
  File "/home/ubuntu/TTS/TTS/utils/synthesizer.py", line 287, in tts
    outputs = synthesis(
  File "/home/ubuntu/TTS/TTS/tts/utils/synthesis.py", line 216, in synthesis
    outputs = run_model_torch(
  File "/home/ubuntu/TTS/TTS/tts/utils/synthesis.py", line 50, in run_model_torch
    outputs = _func(
  File "/home/ubuntu/miniconda3/envs/coqui-local/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/TTS/TTS/tts/models/vits.py", line 1162, in inference
    o = self.waveform_decoder((z * y_mask)[:, :, : self.max_inference_len], g=g)
  File "/home/ubuntu/miniconda3/envs/coqui-local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/TTS/TTS/vocoder/models/hifigan_generator.py", line 250, in forward
    o = o + self.cond_layer(g)
  File "/home/ubuntu/miniconda3/envs/coqui-local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/coqui-local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/ubuntu/miniconda3/envs/coqui-local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
TypeError: conv1d() received an invalid combination of arguments - got (NoneType, Parameter, Parameter, tuple, tuple, tuple, int), but expected one of:
 * (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, tuple of ints padding, tuple of ints dilation, int groups)
      didn't match because some of the arguments have invalid types: (NoneType, Parameter, Parameter, tuple, tuple, tuple, int)
 * (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, str padding, tuple of ints dilation, int groups)
      didn't match because some of the arguments have invalid types: (NoneType, Parameter, Parameter, tuple, tuple, tuple, int)

Expected behavior

The expected behaviour is the regular synthesis of audio using the model.

I can achieve this behaviour by either commenting out the if statement intended for single-speaker Neon models. Or by adding the speaker_embedding calculation to this if statement as shown above.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce GTX 1070"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.13.1+cu117",
        "TTS": "0.10.2",
        "numpy": "1.23.5"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.9",
        "version": "#1 SMP Fri Jan 20 09:54:01 NZDT 2023"
    }
}

Additional context

No response

RedSoutherly avatar Feb 01 '23 23:02 RedSoutherly

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

stale[bot] avatar Mar 04 '23 01:03 stale[bot]

I got the same bug. Did it get resolved?

offside609 avatar Mar 05 '23 19:03 offside609

You don't need speaker encoder for a single-speaker model. You can disable it and just restore the model to finetune.

erogol avatar Mar 06 '23 08:03 erogol

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

stale[bot] avatar Apr 05 '23 19:04 stale[bot]

Was this issue solved? I have the same problem.

  1. Took the pretrained YourTTS model: mutlilingual-multi-dataset-your_tts
  2. Fine tuned on new speaker for 10k steps
  3. Stopped training just to check some progress
  4. Ran tts command like the initial post here, but with just "text", "out_path", "model_path", "config_path" as variables, pointing model_path & config_path to the new fine tuned model on 10k steps.
  5. Got the same error as the initial post mentions
  • Also tried with speakers_file_path & speaker_idx, with the same error:

This is the error:

TypeError: conv1d() received an invalid combination of arguments - got (NoneType, Parameter, Parameter, tuple, tuple, tuple, int), but expected one of: (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, tuple of ints padding, tuple of ints dilation, int groups) didn't match because some of the arguments have invalid types: (NoneType, Parameter, Parameter, tuple of (int,), tuple of (int,), tuple of (int,), int)

danablend avatar Apr 12 '23 16:04 danablend

I can confirm @RedSoutherly 's solution works great.

wonkothesanest avatar May 04 '23 04:05 wonkothesanest

@erogol Sorry about my git mixup, I had to resubmit the fix. Thank you for closing out this issue :)

wonkothesanest avatar May 04 '23 19:05 wonkothesanest

ONNX inference not working for multi-speaker

Hi everyone, I tried converting multi speaker to ONNX but seems the inference feature could not support it

TypeError: conv1d() received an invalid combination of arguments - got (NoneType, Parameter, Parameter, tuple, tuple, tuple, int), but expected one of:

  • (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, tuple of ints padding, tuple of ints dilation, int groups) didn't match because some of the arguments have invalid types: (!NoneType!, !Parameter!, !Parameter!, !tuple of (int,)!, !tuple of (int,)!, !tuple of (int,)!, int)
  • (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, str padding, tuple of ints dilation, int groups) didn't match because some of the arguments have invalid types: (!NoneType!, !Parameter!, !Parameter!, !tuple of (int,)!, !tuple of (int,)!, !tuple of (int,)!, int)

This is my code:

config = VitsConfig() config.load_json("/content/drive/MyDrive/Technical/TTS/TTS API Files/elsie_config.json") vits = Vits.init_from_config(config) vits.load_checkpoint(config, "/content/drive/MyDrive/Technical/TTS/TTS API Files/checkpoint_1054000.pth")

vits.export_onnx() vits.load_onnx("coqui_vits.onnx")

Please is there a solution to fixing the bug. Thank you

Mahlon10 avatar Sep 05 '23 17:09 Mahlon10

Describe the bug

Using a YourTTS model which has been fine-tuned on a VCTK format dataset with a single speaker. This model requires that the speaker_embeddings be given as d_vectors for inference on a learned speaker. However, the following code in TTS/utils/synthesizer.py

# handle Neon models with single speaker.
     if len(self.tts_model.speaker_manager.name_to_id) == 1:

causes execution to skip over the check for 'use_d_vector_file' in the following elif block. This causes speaker_embedding to be left as None, and the program later crashes due to a conv1d layer receiving None as its input.

This is fixed for me if I copy the speaker_embedding calculation to be included in the first if statement. I.e.

# handle Neon models with single speaker.
      if len(self.tts_model.speaker_manager.name_to_id) == 1:
          speaker_id = list(self.tts_model.speaker_manager.name_to_id.values())[0]
          if self.tts_config.use_d_vector_file:
              # get the average speaker embedding from the saved d_vectors.
              speaker_embedding = self.tts_model.speaker_manager.get_mean_embedding(
                  speaker_name, num_samples=None, randomize=False
              )
              speaker_embedding = np.array(speaker_embedding)[None, :]  # [1 x embedding_dim]

To Reproduce

Use the train_yourtts.py recipe to finetune a VCTK format dataset which only has a single speaker. I used the experiment 1 checkpoint of YourTTS as a restore point since this was configured for a single language.

Use the tts command to perform inference using the custom model:

tts --text "Hello" --out_path test.wav --model_path <model-path>/best_model.pth --config_path <model-path>/config.json --speaker_idx <speaker_name>

See the following error:

Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/coqui-local/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/TTS/TTS/bin/synthesize.py", line 357, in main
    wav = synthesizer.tts(
  File "/home/ubuntu/TTS/TTS/utils/synthesizer.py", line 287, in tts
    outputs = synthesis(
  File "/home/ubuntu/TTS/TTS/tts/utils/synthesis.py", line 216, in synthesis
    outputs = run_model_torch(
  File "/home/ubuntu/TTS/TTS/tts/utils/synthesis.py", line 50, in run_model_torch
    outputs = _func(
  File "/home/ubuntu/miniconda3/envs/coqui-local/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/TTS/TTS/tts/models/vits.py", line 1162, in inference
    o = self.waveform_decoder((z * y_mask)[:, :, : self.max_inference_len], g=g)
  File "/home/ubuntu/miniconda3/envs/coqui-local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/TTS/TTS/vocoder/models/hifigan_generator.py", line 250, in forward
    o = o + self.cond_layer(g)
  File "/home/ubuntu/miniconda3/envs/coqui-local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/coqui-local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/ubuntu/miniconda3/envs/coqui-local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
TypeError: conv1d() received an invalid combination of arguments - got (NoneType, Parameter, Parameter, tuple, tuple, tuple, int), but expected one of:
 * (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, tuple of ints padding, tuple of ints dilation, int groups)
      didn't match because some of the arguments have invalid types: (NoneType, Parameter, Parameter, tuple, tuple, tuple, int)
 * (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, str padding, tuple of ints dilation, int groups)
      didn't match because some of the arguments have invalid types: (NoneType, Parameter, Parameter, tuple, tuple, tuple, int)

Expected behavior

The expected behaviour is the regular synthesis of audio using the model.

I can achieve this behaviour by either commenting out the if statement intended for single-speaker Neon models. Or by adding the speaker_embedding calculation to this if statement as shown above.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA GeForce GTX 1070"
        ],
        "available": true,
        "version": "11.7"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.13.1+cu117",
        "TTS": "0.10.2",
        "numpy": "1.23.5"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.9",
        "version": "#1 SMP Fri Jan 20 09:54:01 NZDT 2023"
    }
}

Additional context

No response

Hi @RedSoutherly. I tried your solution by making changes to the vits model:

handle Neon models with single speaker. if len(self.tts_model.speaker_manager.name_to_id) == 1: speaker_id = list(self.tts_model.speaker_manager.name_to_id.values())[0] if self.tts_config.use_d_vector_file: # get the average speaker embedding from the saved d_vectors. speaker_embedding = self.tts_model.speaker_manager.get_mean_embedding( speaker_name, num_samples=None, randomize=False ) speaker_embedding = np.array(speaker_embedding)[None, :] # [1 x embedding_dim]

However I still came across this bug:

TypeError: conv1d() received an invalid combination of arguments - got (NoneType, Parameter, Parameter, tuple, tuple, tuple, int), but expected one of:

(Tensor input, Tensor weight, Tensor bias, tuple of ints stride, tuple of ints padding, tuple of ints dilation, int groups) didn't match because some of the arguments have invalid types: (!NoneType!, !Parameter!, !Parameter!, !tuple of (int,)!, !tuple of (int,)!, !tuple of (int,)!, int) (Tensor input, Tensor weight, Tensor bias, tuple of ints stride, str padding, tuple of ints dilation, int groups) didn't match because some of the arguments have invalid types: (!NoneType!, !Parameter!, !Parameter!, !tuple of (int,)!, !tuple of (int,)!, !tuple of (int,)!, int)

I would want to know if there is a way out please. Thanks

Mahlon10 avatar Sep 06 '23 10:09 Mahlon10

I got a similar kind of error but after close observation I understood that I have not given speaker_idx as input argument. Similarly check whether all input arguments were given correctly or not

pepetikesavasiddhardha avatar Nov 03 '23 15:11 pepetikesavasiddhardha