PaddleSpeech icon indicating copy to clipboard operation
PaddleSpeech copied to clipboard

[TTS]中英混合流式语音合成推理时无声音

Open jianghuakun opened this issue 3 weeks ago • 2 comments

yaml文件如下:

This is the parameter configuration file for streaming tts server.

#################################################################################

SERVER SETTING

################################################################################# host: 0.0.0.0 port: 8090

The task format in the engin_list is: _

engine_list choices = ['tts_online', 'tts_online-onnx'], the inference speed of tts_online-onnx is faster than tts_online.

protocol choices = ['websocket', 'http']

protocol: 'websocket' #engine_list: ['tts_online-onnx'] engine_list: ['tts_online']

#################################################################################

ENGINE CONFIG

#################################################################################

################################### TTS ######################################### ################### speech task: tts; engine_type: online ####################### tts_online: # am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc']
# fastspeech2_cnndecoder_csmsc support streaming am infer.
am: 'fastspeech2_mix'
am_config: #'/home/PaddleSpeech/examples/zh_en_tts/tts3/data/fastspeech2_mix_ckpt_1.2.0/default.yaml' #/root/.paddlespeech/models/fastspeech2_csmsc-zh/1.0/fastspeech2_nosil_baker_ckpt_0.4/default.yaml am_ckpt: #'/home/PaddleSpeech/examples/zh_en_tts/tts3/data/fastspeech2_mix_ckpt_1.2.0/snapshot_iter_99200.pdz' am_stat: #'/home/PaddleSpeech/examples/zh_en_tts/tts3/data/fastspeech2_mix_ckpt_1.2.0/speech_stats.npy' phones_dict: #'/home/PaddleSpeech/examples/zh_en_tts/tts3/data/fastspeech2_mix_ckpt_1.2.0/phone_id_map.txt' tones_dict: speaker_dict: #'/home/PaddleSpeech/examples/zh_en_tts/tts3/data/fastspeech2_mix_ckpt_1.2.0/speaker_id_map.txt' #spk_id: 175

# voc (vocoder) choices=['mb_melgan_csmsc, hifigan_csmsc']
# Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference
voc: 'hifigan_csmsc'
voc_config: #'/home/PaddleSpeech/examples/zh_en_tts/tts3/data/hifigan_csmsc_ckpt_0.1.1/default.yaml'
voc_ckpt: #'/home/PaddleSpeech/examples/zh_en_tts/tts3/data/hifigan_csmsc_ckpt_0.1.1/snapshot_iter_1000000.pdz'
voc_stat: #'/home/PaddleSpeech/examples/zh_en_tts/tts3/data/hifigan_csmsc_ckpt_0.1.1/feats_stats.npy'
# others
lang: 'mix'
device: 'cpu' # set 'gpu:id' or 'cpu'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block: 72
am_pad: 12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
voc_block: 36
voc_pad: 19

#################################################################################

ENGINE CONFIG

#################################################################################

################################### TTS ######################################### ################### speech task: tts; engine_type: online-onnx ####################### tts_online-onnx: # am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx'] # fastspeech2_cnndecoder_csmsc_onnx support streaming am infer.
am: 'fastspeech2_csmsc_onnx' # am_ckpt is a list, if am is fastspeech2_cnndecoder_csmsc_onnx, am_ckpt = [encoder model, decoder model, postnet model]; # if am is fastspeech2_csmsc_onnx, am_ckpt = [ckpt model]; #am_config: 'fastspeech2_csmsc_onnx/fastspeech2_nosil_baker_ckpt_0.4/' am_ckpt: #['/root/.paddlespeech/models/fastspeech2_csmsc_onnx-zh/1.0/fastspeech2_csmsc_onnx_0.2.0/fastspeech2_csmsc.onnx'] #['/home/PaddleSpeech/examples/zh_en_tts/tts3/fastspeech2_cnndecoder_csmsc_onnx/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/fastspeech2_csmsc_am_encoder_infer.onnx', #'/home/PaddleSpeech/examples/zh_en_tts/tts3/fastspeech2_cnndecoder_csmsc_onnx/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/fastspeech2_csmsc_am_decoder.onnx', #'/home/PaddleSpeech/examples/zh_en_tts/tts3/fastspeech2_cnndecoder_csmsc_onnx/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/fastspeech2_csmsc_am_postnet.onnx'] #'fastspeech2_csmsc_onnx/fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz' # list am_stat: #'/home/PaddleSpeech/examples/zh_en_tts/tts3/fastspeech2_cnndecoder_csmsc_onnx/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/speech_stats.npy' #'fastspeech2_csmsc_onn2_cnndecoder_csmsc_onnx/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/speech_stats.npy' #/fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy' phones_dict: #'/root/.paddlespeech/models/fastspeech2_csmsc_onnx-zh/1.0/fastspeech2_csmsc_onnx_0.2.0/phone_id_map.txt' #'/home/PaddleSpeech/examples/zh_en_tts/tts3/fastspeech2_cnndecoder_csmsc_onnx/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/phone_id_map.txt' #'fastspeech2_csmsc_onnx/fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt' tones_dict: speaker_dict: #'fastspeech2_csmsc_onnx/fastspeech2_nosil_baker_ckpt_0.4/' am_sample_rate: 24000 am_sess_conf: device: "cpu" # set 'gpu:id' or 'cpu' use_trt: False cpu_threads: 12

# voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx']
# Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference
voc: 'mb_melgan_csmsc_onnx'
voc_ckpt: 
voc_sample_rate: 24000
voc_sess_conf:
    device: "cpu" # set 'gpu:id' or 'cpu'
    use_trt: False
    cpu_threads: 12

# others
lang: 'zh'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block: 72
am_pad: 12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc_onnx, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
voc_block: 36
voc_pad: 14
# voc_upsample should be same as n_shift on voc config.
voc_upsample: 300

tts_engine.py增加mix源码: elif am_dataset == "mix": # am spk_id = 174 spk_id = [spk_id] mel = self.executor.am_inference( part_phone_ids) if first_flag == 1: first_am_et = time.time() self.first_am_infer = first_am_et - frontend_et # voc streaming mel_chunks = get_chunks(mel, self.voc_block, self.voc_pad, "voc") voc_chunk_num = len(mel_chunks) voc_st = time.time() for i, mel_chunk in enumerate(mel_chunks): sub_wav = self.executor.voc_inference(mel_chunk) sub_wav = self.depadding(sub_wav, voc_chunk_num, i, self.voc_block, self.voc_pad, self.voc_upsample) if first_flag == 1: first_voc_et = time.time() self.first_voc_infer = first_voc_et - first_am_et self.first_response_time = first_voc_et - frontend_st first_flag = 0

                yield sub_wav

其他判断增加了混合模型

jianghuakun avatar Jun 13 '24 06:06 jianghuakun