unsloth icon indicating copy to clipboard operation
unsloth copied to clipboard

[Bug] Orpheus_tts espanish finetune ,cannot generate valid voice

Open yxk9810 opened this issue 1 month ago • 4 comments

changes:

  1. model changed to canopylabs/3b-es_it-ft-research_release

  2. max lenght : 3200

  3. def redistribute_codes(code_list): if len(code_list) == 0: print("Warning: Empty code list, returning silence") return torch.zeros(1, 1, 24000) # 1秒的静音

    layer_1 = [] layer_2 = [] layer_3 = []

    for i in range(len(code_list) // 7): try: c0 = code_list[7i] c1 = code_list[7i+1] - 4096 c2 = code_list[7i+2] - (24096) c3 = code_list[7i+3] - (34096) c4 = code_list[7i+4] - (44096) c5 = code_list[7i+5] - (54096) c6 = code_list[7i+6] - (64096)

         # 检查范围并裁剪
         c0 = max(0, min(c0, 4095))
         c1 = max(0, min(c1, 4095))
         c2 = max(0, min(c2, 4095))
         c3 = max(0, min(c3, 4095))
         c4 = max(0, min(c4, 4095))
         c5 = max(0, min(c5, 4095))
         c6 = max(0, min(c6, 4095))
    
         layer_1.append(c0)
         layer_2.append(c1)
         layer_3.append(c2)
         layer_3.append(c3)
         layer_2.append(c4)
         layer_3.append(c5)
         layer_3.append(c6)
    
     except Exception as e:
         print(f"Error at frame {i}: {e}")
         continue
    

    if len(layer_1) == 0: print("Warning: No valid codes decoded, returning silence") return torch.zeros(1, 1, 24000)

    codes = [ torch.tensor(layer_1, dtype=torch.long).unsqueeze(0), torch.tensor(layer_2, dtype=torch.long).unsqueeze(0), torch.tensor(layer_3, dtype=torch.long).unsqueeze(0) ]

    audio_hat = snac_model.decode(codes) return audio_hat

only generate silent audio

yxk9810 avatar Oct 30 '25 12:10 yxk9810

@Etherll Unsure if you know anything about this

danielhanchen avatar Nov 01 '25 12:11 danielhanchen

Can you share your setup? like notebook link I tried the official notebook and it worked fine. I set model_name = 'canopylabs/3b-es_it-ft-research_release and trained for one epoch using the ylacombe/google-argentinian-spanish dataset it already sounds pretty good to me

Here's the audio after fine-tuning for reference: https://vocaroo.com/15Kfj81345kI

I think the problem is due to your redistribute_codes function changes

Etherll avatar Nov 01 '25 23:11 Etherll

Sorry for the late reply. this is the notebook i used, https://colab.research.google.com/drive/1sINXJCjZFPQDtUD9nYBdS3vRkKndLgsJ?usp=sharing The data i used can be downloaded from this link: https://drive.google.com/file/d/1rpMGSQLcdas9oM6xQ31rsvtp_KU7BxtF/view?usp=sharing I'm new to speech fine-tuning,would greatly appreciate it if you could help me.

Can you share your setup? like notebook link I tried the official notebook and it worked fine. I set model_name = 'canopylabs/3b-es_it-ft-research_release and trained for one epoch using the ylacombe/google-argentinian-spanish dataset it already sounds pretty good to me

Here's the audio after fine-tuning for reference: https://vocaroo.com/15Kfj81345kI

I think the problem is due to your redistribute_codes function changes

@Etherll

yxk9810 avatar Nov 13 '25 08:11 yxk9810

I'm facing a strange problem. I trained the model on a custom dataset from a female speaker, but the output voice is male. My dataset was created by splitting a female speaker's audio into 10s clips + ASR. Trained for 20 steps. The training process completes without errors. I tested with the official female dataset, and it works correctly, producing a female voice. This suggests the issue is likely with my custom dataset. Could you provide any insights on what might be going wrong? Thanks. @Etherll , you can find my notebook https://colab.research.google.com/drive/1V0b2wza_SGx9hs26g-JNov_cufS5HIq6?usp=sharing

yxk9810 avatar Nov 14 '25 07:11 yxk9810