Amphion [Feature]: can MaskGCT process Chinese zero-shot TTS?

when trying the inference for Chinese TTS, it will turn out the following error: RuntimeError: The size of tensor a (1649) must match the size of tensor b (1758) at non-singleton dimension 3

I have chosen the language “zh”. so could you let me know:

does the current MaskGCT support Chinese?
or what did I do wrong? how can I handle it??

thank you very much!

Oct 29 '24 11:10 hildazzz

Hi, the current MaskGCT supports Chinese (in fact, we support six languages: en, zh, fr, de, kr, ja), can you give me more details about the error, for example, a screenshot.

Oct 29 '24 15:10 HeCheng0625

Hi, the current MaskGCT supports Chinese (in fact, we support six languages: en, zh, fr, de, kr, ja), can you give me more details about the error, for example, a screenshot.

like this:

Traceback (most recent call last):
  File "/try/Amphion/test.py", line 120, in <module>
    recovered_audio = maskgct_inference_pipeline.maskgct_inference(
  File "/try/Amphion/models/tts/maskgct/maskgct_utils.py", line 261, in maskgct_inference
    combine_semantic_code, _ = self.text2semantic(
  File "/root/miniforge3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/try/Amphion/models/tts/maskgct/maskgct_utils.py", line 175, in text2semantic
    predict_semantic = self.t2s_model.reverse_diffusion(
  File "/root/miniforge3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/try/Amphion/models/tts/maskgct/maskgct_t2s.py", line 292, in reverse_diffusion
    mask_embeds = self.diff_estimator(
  File "/root/miniforge3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniforge3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/try/Amphion/models/tts/maskgct/llama_nar.py", line 621, in forward
    layer_outputs = decoder_layer(
  File "/root/miniforge3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniforge3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/try/Amphion/models/tts/maskgct/llama_nar.py", line 173, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/root/miniforge3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/miniforge3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/miniforge3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 378, in forward
    attn_weights = attn_weights + causal_mask
RuntimeError: The size of tensor a (1008) must match the size of tensor b (1019) at non-singleton dimension 3

mostly in this case when using "zh" in language or target_language. sometimes it will disappear when the target_text set more shorter. does the target text length has a setting or preference in this work? thanks for your time!

Oct 30 '24 09:10 hildazzz