Amphion icon indicating copy to clipboard operation
Amphion copied to clipboard

[Help]: MaskGCT's results were very strange

Open WhiteNightMo opened this issue 1 year ago • 11 comments

Problem Overview

I modified this file models/tts/maskgct/maskgct_inference.py, changes are as follows:

    # inference
    prompt_wav_path = "./models/tts/maskgct/wav/5s.wav"
    save_path = "generated_audio7.wav"
    prompt_text = "想要交友吗?快来SOUL啊"
    target_text = "新用户真的可以享年化利率最低3.6%的优惠"
    # Specify the target duration (in seconds). If target_len = None, we use a simple rule to predict the target duration.
    target_len = None
    maskgct_inference_pipeline = MaskGCT_Inference_Pipeline(
        semantic_model,
        semantic_codec,
        codec_encoder,
        codec_decoder,
        t2s_model,
        s2a_model_1layer,
        s2a_model_full,
        semantic_mean,
        semantic_std,
        device,
    )

    recovered_audio = maskgct_inference_pipeline.maskgct_inference(
        prompt_wav_path, prompt_text, target_text, "zh", "zh", target_len=target_len
    )

    sf.write(save_path, recovered_audio, 24000)

Run command:

python -m models.tts.maskgct.maskgct_inference

The output did not meet my expectations.

My original file: 5s.zip

Output file: generated_audio7.zip

WhiteNightMo avatar Nov 21 '24 03:11 WhiteNightMo

It seems like the larget len is too long, you can specify the appropriate target length yourself.

HeCheng0625 avatar Nov 21 '24 05:11 HeCheng0625

It seems like the larget len is too long, you can specify the appropriate target length yourself.

I tried to change target_len to 8, but the output audio was missing the first 3 words and was slow overall. When I changed it to 10, the output audio read everything, but the speed was really slow. 10s.zip

WhiteNightMo avatar Nov 21 '24 06:11 WhiteNightMo

as missing the first 3 words and was slow overall. When I changed it to 10, the output audio read everything, but the speed was really slow.

请问解决了吗

decajcd avatar Nov 22 '24 01:11 decajcd

as missing the first 3 words and was slow overall. When I changed it to 10, the output audio read everything, but the speed was really slow.

请问解决了吗

没呢,倒腾不出来

WhiteNightMo avatar Nov 22 '24 01:11 WhiteNightMo

as missing the first 3 words and was slow overall. When I changed it to 10, the output audio read everything, but the speed was really slow.

请问解决了吗

没呢,倒腾不出来

调不出来,要么太快要么胡说八道

decajcd avatar Nov 22 '24 01:11 decajcd

as missing the first 3 words and was slow overall. When I changed it to 10, the output audio read everything, but the speed was really slow.

请问解决了吗

没呢,倒腾不出来

调不出来,要么太快要么胡说八道

难顶,我是要么太慢要么胡说八道

WhiteNightMo avatar Nov 22 '24 01:11 WhiteNightMo

as missing the first 3 words and was slow overall. When I changed it to 10, the output audio read everything, but the speed was really slow.

请问解决了吗

没呢,倒腾不出来

调不出来,要么太快要么胡说八道

难顶,我是要么太慢要么胡说八道

我还有背景音

decajcd avatar Nov 22 '24 02:11 decajcd

有人解决了吗? Anybody fixed this?

digitalboy avatar Nov 24 '24 05:11 digitalboy

your prompt audio and prompt text are not matched completely

ruby11dog avatar Nov 25 '24 04:11 ruby11dog

这边合成出来的语音很多背景噪声, 只能大致听出内容和音色信息,查了很多发现用 从wav 里面提取 的sematic code 也会有这个问题,楼主的还正常些,可以放下conda list 看下版本信息吗

wjinwei avatar Dec 05 '24 09:12 wjinwei

请问这个背景音的问题解决了吗

QiushiStaff avatar Apr 17 '25 07:04 QiushiStaff