TTS
TTS copied to clipboard
[Bug] Fine tuned XTTS v2 produces strange sounds for short text
Describe the bug
I have fine tuned XTTS v2 model on my own data containing both long and short audios (with the following histogram showing duration in seconds on x-axis. Labels 'old' and 'new' represent 2 datasets with long and short audios respectively.)
But the model produces strange sounds in case of 1-2 words text, like the following 2 examples for text='hola'
:
https://github.com/coqui-ai/TTS/assets/59258087/9e734e4b-3954-4adf-9919-7af42c8a28ad
https://github.com/coqui-ai/TTS/assets/59258087/f2f4b964-e1cd-4986-9f4c-d082a0a53d10
It seems like the model tries to produce at least 3 seconds audio even if the text is very short. And thus it adds some meaningless sounds to the sound of the original word in text.
@erogol Is there any way to avoid this behavior? or any parameter (may be in model args) to control this?
There are gpt_start_audio_token
and gpt_stop_audio_token
parameters in TTS.tts.models.xtts.XttsArgs()
class but i am not sure what is the impact of these parameters?
To Reproduce
N/A
Expected behavior
Should produce short audio for short text.
Logs
No response
Environment
{
"CUDA": {
"GPU": [
"NVIDIA A30",
"NVIDIA A30",
"NVIDIA A30",
"NVIDIA A30"
],
"available": true,
"version": "12.1"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "2.1.0+cu121",
"TTS": "0.22.0",
"numpy": "1.23.0"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
"ELF"
],
"processor": "x86_64",
"python": "3.10.12",
"version": "#64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023"
}
}
Additional context
No response
I tried several times to re-cut the data into ranges from 0.5s to 20s, guaranteeing alignment with the corresponding text. But nothing improves. There might be a difference between model args in the training recipe and in the already trained model provided.
@erogol Can you please make sure the model args provided in the training recipe are the same as your own trained model?
Same Issues
@bensonbs Have you fine tuned the xtts-v2 model on your own dataset? Can you share a histogram of the audio lengths of your dataset? Have you tried to modify the training code or model args to avoid this?
Same Issues
Same issue. Pre-trained XTTSv2 produces extra speech after the intended "text", 10-20% of the time
Same issue. The pretrained Xtts v2 generate extra speech randomly.
I have implemented the Diversified Perturbation Optimized (DPO) loss in TTS/tts/layers/xtts/gpt.py
to improve the model's generalization ability and robustness. This implementation aims to address the issue of strange sounds occurring for short text inputs. By introducing the DPO loss, the model is expected to generate more consistent and natural-sounding audio output, even for shorter text sequences.
Code Snippet:
TTS/tts/layers/xtts/gpt.py
text_logits, mel_logits = self.get_logits(
text_emb,
self.text_head,
mel_emb,
self.mel_head,
prompt=cond_latents,
get_attns=return_attentions,
return_latent=return_latent,
attn_mask_cond=attn_mask_cond,
attn_mask_text=attn_mask_text,
attn_mask_mel=attn_mask_mel,
)
reject_text_logits, reject_mel_logits = self.get_logits(
text_emb,
self.text_head,
mel_emb,
self.mel_head,
prompt=cond_latents,
get_attns=return_attentions,
return_latent=return_latent,
attn_mask_cond=attn_mask_cond,
attn_mask_text=attn_mask_text,
attn_mask_mel=attn_mask_mel,
)
text_probs = F.softmax(text_logits, dim=-1)
mel_probs = F.softmax(mel_logits, dim=-1)
loss_text_dpo = F.cross_entropy(reject_text_logits, text_probs)
loss_mel_dpo = F.cross_entropy(reject_mel_logits, mel_probs)
TTS/tts/layers/xtts/trainer/gpt_trainer.py
loss_dict["loss_text_ce"] = loss_text * self.args.gpt_loss_text_ce_weight
loss_dict["loss_mel_ce"] = loss_mel * self.args.gpt_loss_mel_ce_weight
loss_dict["loss_text_dpo"] = loss_text_dpo * self.args.gpt_loss_text_ce_weight
loss_dict["loss_mel_dpo"] = loss_mel_dpo * self.args.gpt_loss_mel_ce_weight
loss_dict["loss"] = loss_dict["loss_text_ce"] + loss_dict["loss_mel_ce"] + loss_dict["loss_text_dpo"] + loss_dict["loss_mel_dpo"]
- VRAM Usage and Training Time Comparison:
- Without DPO loss: VRAM usage: X GB Training time per epoch: Y minutes
- With DPO loss: VRAM usage: 2X GB Training time per epoch: 2Y minutes
can you give me an explanation? and how to try it?
can you give me an explanation? and how to try it?
When the GPT-2 model generates shorter sentences, it sometimes fails to accurately produce the [STOP] token, resulting in the inclusion of peculiar sounds in the generated content. These sounds may be inconsistent as they are not explicitly guided, meaning that each generation might differ. To address this issue, during training, I compare the outputs of two generations produced under the same conditions to detect any peculiar sounds. Whether both generations contain strange sounds or only one does while the other doesn't, the model receives a penalty. This encourages it to avoid generating incoherent random content.
Methods can refer to the modifications in TTS/tts/layers/xtts/gpt.py
and TTS/tts/layers/xtts/trainer/gpt_trainer.py
.
I am currently testing which loss function is more stable. Compared to cross entropy, MSE can more accurately eliminate abnormal sounds, but I am not sure if it is theoretically correct.
This method can only be used during fine-tuning, and when using this method, make sure that your fine-tuning dataset includes enough short audio files.
can you give me an explanation? and how to try it?
When the GPT-2 model generates shorter sentences, it sometimes fails to accurately produce the [STOP] token, resulting in the inclusion of peculiar sounds in the generated content. These sounds may be inconsistent as they are not explicitly guided, meaning that each generation might differ. To address this issue, during training, I compare the outputs of two generations produced under the same conditions to detect any peculiar sounds. Whether both generations contain strange sounds or only one does while the other doesn't, the model receives a penalty. This encourages it to avoid generating incoherent random content.
Methods can refer to the modifications in
TTS/tts/layers/xtts/gpt.py
andTTS/tts/layers/xtts/trainer/gpt_trainer.py
. I am currently testing which loss function is more stable. Compared to cross entropy, MSE can more accurately eliminate abnormal sounds, but I am not sure if it is theoretically correct.This method can only be used during fine-tuning, and when using this method, make sure that your fine-tuning dataset includes enough short audio files.
Wouldn't it be easier to impose a penalty on the length of the generated sequence, based on median character-per-second data?
can you give me an explanation? and how to try it?
When the GPT-2 model generates shorter sentences, it sometimes fails to accurately produce the [STOP] token, resulting in the inclusion of peculiar sounds in the generated content. These sounds may be inconsistent as they are not explicitly guided, meaning that each generation might differ. To address this issue, during training, I compare the outputs of two generations produced under the same conditions to detect any peculiar sounds. Whether both generations contain strange sounds or only one does while the other doesn't, the model receives a penalty. This encourages it to avoid generating incoherent random content.
Methods can refer to the modifications in
TTS/tts/layers/xtts/gpt.py
andTTS/tts/layers/xtts/trainer/gpt_trainer.py
. I am currently testing which loss function is more stable. Compared to cross entropy, MSE can more accurately eliminate abnormal sounds, but I am not sure if it is theoretically correct.This method can only be used during fine-tuning, and when using this method, make sure that your fine-tuning dataset includes enough short audio files.
can you share some sample with DPO loss ?
@bensonbs Thank you for your clear explanation, Could you please share some samples after applying DPO and the audio quality?
Same Issues
Hi everybody, I found the optimal way to fix this issues. Just finetune Dvae with your data :D
.
Hi everybody, I found the optimal way to fix this issues. Just finetune Dvae with your data :D
Hello, can you be more specific?