TTS icon indicating copy to clipboard operation
TTS copied to clipboard

[Bug] Fine tuned XTTS v2 produces strange sounds for short text

Open ukemamaster opened this issue 1 year ago β€’ 20 comments

Describe the bug

I have fine tuned XTTS v2 model on my own data containing both long and short audios (with the following histogram showing duration in seconds on x-axis. Labels 'old' and 'new' represent 2 datasets with long and short audios respectively.)

data_es_mix_hist

But the model produces strange sounds in case of 1-2 words text, like the following 2 examples for text='hola':

https://github.com/coqui-ai/TTS/assets/59258087/9e734e4b-3954-4adf-9919-7af42c8a28ad

https://github.com/coqui-ai/TTS/assets/59258087/f2f4b964-e1cd-4986-9f4c-d082a0a53d10

It seems like the model tries to produce at least 3 seconds audio even if the text is very short. And thus it adds some meaningless sounds to the sound of the original word in text.

@erogol Is there any way to avoid this behavior? or any parameter (may be in model args) to control this? There are gpt_start_audio_token and gpt_stop_audio_token parameters in TTS.tts.models.xtts.XttsArgs() class but i am not sure what is the impact of these parameters?

To Reproduce

N/A

Expected behavior

Should produce short audio for short text.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA A30",
            "NVIDIA A30",
            "NVIDIA A30",
            "NVIDIA A30"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.0+cu121",
        "TTS": "0.22.0",
        "numpy": "1.23.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.12",
        "version": "#64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023"
    }
}

Additional context

No response

ukemamaster avatar Jan 15 '24 11:01 ukemamaster

I tried several times to re-cut the data into ranges from 0.5s to 20s, guaranteeing alignment with the corresponding text. But nothing improves. There might be a difference between model args in the training recipe and in the already trained model provided.

@erogol Can you please make sure the model args provided in the training recipe are the same as your own trained model?

ukemamaster avatar Jan 16 '24 09:01 ukemamaster

Same Issues

bensonbs avatar Jan 17 '24 03:01 bensonbs

@bensonbs Have you fine tuned the xtts-v2 model on your own dataset? Can you share a histogram of the audio lengths of your dataset? Have you tried to modify the training code or model args to avoid this?

ukemamaster avatar Jan 17 '24 08:01 ukemamaster

Same Issues

insomnia777 avatar Feb 06 '24 21:02 insomnia777

Same issue. Pre-trained XTTSv2 produces extra speech after the intended "text", 10-20% of the time

kaveenkumar avatar Feb 29 '24 15:02 kaveenkumar

Same issue. The pretrained Xtts v2 generate extra speech randomly.

peterliu2023 avatar Apr 10 '24 05:04 peterliu2023

I have implemented the Diversified Perturbation Optimized (DPO) loss in TTS/tts/layers/xtts/gpt.py to improve the model's generalization ability and robustness. This implementation aims to address the issue of strange sounds occurring for short text inputs. By introducing the DPO loss, the model is expected to generate more consistent and natural-sounding audio output, even for shorter text sequences.

Code Snippet: TTS/tts/layers/xtts/gpt.py

text_logits, mel_logits = self.get_logits(
    text_emb,
    self.text_head,
    mel_emb,
    self.mel_head,
    prompt=cond_latents,
    get_attns=return_attentions,
    return_latent=return_latent,
    attn_mask_cond=attn_mask_cond,
    attn_mask_text=attn_mask_text,
    attn_mask_mel=attn_mask_mel,
)

reject_text_logits, reject_mel_logits = self.get_logits(
    text_emb,
    self.text_head,
    mel_emb,
    self.mel_head,
    prompt=cond_latents,
    get_attns=return_attentions,
    return_latent=return_latent,
    attn_mask_cond=attn_mask_cond,
    attn_mask_text=attn_mask_text,
    attn_mask_mel=attn_mask_mel,
)
text_probs = F.softmax(text_logits, dim=-1)
mel_probs = F.softmax(mel_logits, dim=-1)

loss_text_dpo = F.cross_entropy(reject_text_logits, text_probs)
loss_mel_dpo = F.cross_entropy(reject_mel_logits, mel_probs)

TTS/tts/layers/xtts/trainer/gpt_trainer.py

        loss_dict["loss_text_ce"] = loss_text * self.args.gpt_loss_text_ce_weight
        loss_dict["loss_mel_ce"] = loss_mel * self.args.gpt_loss_mel_ce_weight
        loss_dict["loss_text_dpo"] = loss_text_dpo * self.args.gpt_loss_text_ce_weight
        loss_dict["loss_mel_dpo"] = loss_mel_dpo * self.args.gpt_loss_mel_ce_weight
        loss_dict["loss"] = loss_dict["loss_text_ce"] + loss_dict["loss_mel_ce"] + loss_dict["loss_text_dpo"] + loss_dict["loss_mel_dpo"]
        
  • VRAM Usage and Training Time Comparison:
    • Without DPO loss: VRAM usage: X GB Training time per epoch: Y minutes
    • With DPO loss: VRAM usage: 2X GB Training time per epoch: 2Y minutes

bensonbs avatar Apr 12 '24 02:04 bensonbs

can you give me an explanation? and how to try it?

insomnia777 avatar Apr 13 '24 23:04 insomnia777

can you give me an explanation? and how to try it?

When the GPT-2 model generates shorter sentences, it sometimes fails to accurately produce the [STOP] token, resulting in the inclusion of peculiar sounds in the generated content. These sounds may be inconsistent as they are not explicitly guided, meaning that each generation might differ. To address this issue, during training, I compare the outputs of two generations produced under the same conditions to detect any peculiar sounds. Whether both generations contain strange sounds or only one does while the other doesn't, the model receives a penalty. This encourages it to avoid generating incoherent random content.

Methods can refer to the modifications in TTS/tts/layers/xtts/gpt.py and TTS/tts/layers/xtts/trainer/gpt_trainer.py. I am currently testing which loss function is more stable. Compared to cross entropy, MSE can more accurately eliminate abnormal sounds, but I am not sure if it is theoretically correct.

This method can only be used during fine-tuning, and when using this method, make sure that your fine-tuning dataset includes enough short audio files.

bensonbs avatar Apr 15 '24 07:04 bensonbs

can you give me an explanation? and how to try it?

When the GPT-2 model generates shorter sentences, it sometimes fails to accurately produce the [STOP] token, resulting in the inclusion of peculiar sounds in the generated content. These sounds may be inconsistent as they are not explicitly guided, meaning that each generation might differ. To address this issue, during training, I compare the outputs of two generations produced under the same conditions to detect any peculiar sounds. Whether both generations contain strange sounds or only one does while the other doesn't, the model receives a penalty. This encourages it to avoid generating incoherent random content.

Methods can refer to the modifications in TTS/tts/layers/xtts/gpt.py and TTS/tts/layers/xtts/trainer/gpt_trainer.py. I am currently testing which loss function is more stable. Compared to cross entropy, MSE can more accurately eliminate abnormal sounds, but I am not sure if it is theoretically correct.

This method can only be used during fine-tuning, and when using this method, make sure that your fine-tuning dataset includes enough short audio files.

Wouldn't it be easier to impose a penalty on the length of the generated sequence, based on median character-per-second data?

insomnia777 avatar Apr 15 '24 14:04 insomnia777

can you give me an explanation? and how to try it?

When the GPT-2 model generates shorter sentences, it sometimes fails to accurately produce the [STOP] token, resulting in the inclusion of peculiar sounds in the generated content. These sounds may be inconsistent as they are not explicitly guided, meaning that each generation might differ. To address this issue, during training, I compare the outputs of two generations produced under the same conditions to detect any peculiar sounds. Whether both generations contain strange sounds or only one does while the other doesn't, the model receives a penalty. This encourages it to avoid generating incoherent random content.

Methods can refer to the modifications in TTS/tts/layers/xtts/gpt.py and TTS/tts/layers/xtts/trainer/gpt_trainer.py. I am currently testing which loss function is more stable. Compared to cross entropy, MSE can more accurately eliminate abnormal sounds, but I am not sure if it is theoretically correct.

This method can only be used during fine-tuning, and when using this method, make sure that your fine-tuning dataset includes enough short audio files.

can you share some sample with DPO loss ?

tuanh123789 avatar Jun 18 '24 09:06 tuanh123789

@bensonbs Thank you for your clear explanation, Could you please share some samples after applying DPO and the audio quality?

saiful9379 avatar Jul 05 '24 11:07 saiful9379

Same Issues

anhnh2002 avatar Jul 27 '24 16:07 anhnh2002

Hi everybody, I found the optimal way to fix this issues. Just finetune Dvae with your data :D

tuanh123789 avatar Jul 30 '24 08:07 tuanh123789

.

nvtinh368 avatar Aug 17 '24 10:08 nvtinh368

Hi everybody, I found the optimal way to fix this issues. Just finetune Dvae with your data :D

Hello, can you be more specific?

nvtinh368 avatar Aug 17 '24 10:08 nvtinh368