FunASR icon indicating copy to clipboard operation
FunASR copied to clipboard

Paraformer Online real mode Error @ chunk+vad

Open dancingonmoon opened this issue 2 years ago • 0 comments
trafficstars

according to 图1 FunASR实时语音听写服务架构图, below steps shall do , in order to realize the online stream mode:

  1. to split 600ms chunk ( 300ms overlayed);
  2. for each chunk: produce VAD list (actually, most time, only one vad in the list due to short 600ms chunk period)
  3. for each vad in Vad list: a) do asr (stream_mode), produce vad_txt, b) do punctuation (punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727) and update vad_txt to be punc_txt.

the above workflow seems always produces below error:

vad_segments:{}
vad_segments:{}
vad_segments:{}
vad_segments:{}
vad_segments:{}
vad_segments:{'text': [[0, 480]]}
vad_0: 0.0->7680.0: stride_vad_clip_result:{'text': '啊就'}
punc_0: 0.0->7680.0: stride_vad_clip_result:{'text': '啊,就'}
vad_segments:{'text': [[0, -1]]}
vad_0: 0.0->-16.0: stride_vad_clip_result:{'text': '对'}
punc_0: 0.0->-16.0: stride_vad_clip_result:{'text': '对'}
vad_segments:{}
vad_segments:{'text': [[-1, 1480]]}
vad_0: -16.0->23680.0: stride_vad_clip_result:{'text': '对'}
punc_0: -16.0->23680.0: stride_vad_clip_result:{'text': '对'}
vad_segments:{}
vad_segments:{}
vad_segments:{'text': [[970, -1]]}
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/gradio/routes.py", line 534, in predict
    output = await route_utils.call_process_api(
  File "/opt/conda/lib/python3.8/site-packages/gradio/route_utils.py", line 226, in call_process_api
    output = await app.get_blocks().process_api(
  File "/opt/conda/lib/python3.8/site-packages/gradio/blocks.py", line 1550, in process_api
    result = await self.call_function(
  File "/opt/conda/lib/python3.8/site-packages/gradio/blocks.py", line 1185, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/opt/conda/lib/python3.8/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/opt/conda/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/opt/conda/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/opt/conda/lib/python3.8/site-packages/gradio/utils.py", line 661, in wrapper
    response = f(*args, **kwargs)
  File "ParaformerOnline.py", line 113, in RUN
    speech_stride_vad_clip_result = inference_pipeline(
  File "/opt/conda/lib/python3.8/site-packages/modelscope/pipelines/audio/asr_inference_pipeline.py", line 258, in __call__
    output = self.forward(output, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/modelscope/pipelines/audio/asr_inference_pipeline.py", line 511, in forward
    inputs['asr_result'] = self.run_inference(self.cmd, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/modelscope/pipelines/audio/asr_inference_pipeline.py", line 586, in run_inference
    asr_result = self.funasr_infer_modelscope(cmd['name_and_type'],
  File "/opt/conda/lib/python3.8/site-packages/funasr/bin/asr_inference_launch.py", line 1336, in _forward
    asr_result = speech2text(cache, raw_inputs, input_lens)
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/funasr/bin/asr_infer.py", line 802, in __call__
    if feats.shape[1] != 0:
IndexError: tuple index out of range
vad_segments:{}
vad_segments:{}
vad_segments:{}
vad_segments:{'text': [[440, 980]]}
vad_0: 7040.0->15680.0: stride_vad_clip_result:{}

the output becomes like : 我是哎那不起你飞机平时是真不是嗯没样工业 no seen of punc inside.

my code is as below:

def RUN(audio_stream, speech_txt):
    """
    speech_txt: 之前累计的识别文本punc后的列表
    """
    samplerate = 16000  
    speech = np.empty(shape=None, dtype=np.float32)
    speech, samplerate = librosa.load(
            audio_stream,
            sr=16000,
            mono=True,
            offset=0.0,
            duration=None,
            dtype=np.float32,
        )

    speech_length = speech.shape[0]

    # speech_stride_vad_segments = inference_pipeline_vad(audio_in=speech)

    sample_offset = 0
    chunk_size = [5, 10, 5]  # 第一个5为左看5帧,10为text_n 10帧,为600ms, 第二个5为右看5帧
    # stride_size = chunk_size[1] * 960 # 为什么是960 ?
    stride_size = chunk_size[1] * 960
    param_dict = {"cache": dict(), "is_final": False, "chunk_size": chunk_size}
    param_punc = {"cache": []}
    # final_result = ""
    punc_list = []  # list以送入punc online mode
    # punc_list.append(speech_txt) # 将之前的punc后的txt,作为vad之后的第一个,再送入pineline_punc()

    for sample_offset in range(
        0, speech_length, min(stride_size, speech_length - sample_offset)
    ):
        if sample_offset + stride_size >= speech_length - 1:
            stride_size = speech_length - sample_offset
            param_dict["is_final"] = True
        else:
            param_dict["is_final"] = False

        speech_stride = speech[sample_offset : sample_offset + stride_size]
        # 1. VAD:
        speech_stride_vad_segments = inference_pipeline_vad(
            audio_in=speech_stride, param_dict=param_dict
        )
        print(f"vad_segments:{speech_stride_vad_segments}")
        # 2. ASR:
        if "text" in speech_stride_vad_segments:  # 检测到音频数据后,每隔600ms进行一次流式模型推理
            for i, segments in enumerate(speech_stride_vad_segments["text"]):
                beg_idx = segments[0] * samplerate / 1000
                end_idx = segments[1] * samplerate / 1000
                speech_stride_vad_clip = speech_stride[int(beg_idx) : int(end_idx)]
                
                speech_stride_vad_clip_result = inference_pipeline(
                    audio_in=speech_stride_vad_clip, param_dict=param_dict
                )
                print(
                    f"vad_{i}: {beg_idx}->{end_idx}: stride_vad_clip_result:{speech_stride_vad_clip_result}"
                )

                # 3. punc
               
                if "text" in speech_stride_vad_clip_result:  # 语音识别出非空时,
                    speech_stride_vad_clip_punc_result = inference_pipeline_punc(
                        text_in=speech_stride_vad_clip_result['text'],
                        param_dict=param_punc)
                    print(
                    f"punc_{i}: {beg_idx}->{end_idx}: stride_vad_clip_result:{speech_stride_vad_clip_punc_result}")
                   
                else:
                    speech_stride_vad_clip_punc_result = speech_stride_vad_clip_result
                # 3. punc online:
 

                if 'text' in speech_stride_vad_clip_punc_result:
                    speech_txt += speech_stride_vad_clip_punc_result["text"]

    return speech_txt, speech_txt

could anyone help figure out what the error stands for ? and how to figure it out ?

OS: [e.g. linux] linux

Python/C++ Version:3.8

Package Version:pytorch、torchaudio、modelscope、funasr version(pip list) as Modescope Notebook

Model: inference_pipeline_vad = pipeline( task=Tasks.voice_activity_detection, model="damo/speech_fsmn_vad_zh-cn-16k-common-pytorch", model_revision=None, output_dir=output_dir, batch_size=1, mode="online", ) inference_pipeline = pipeline( task=Tasks.auto_speech_recognition, model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online", model_revision=None, # update_model=False, mode="paraformer_streaming", output_dir=output_dir, ) inference_pipeline_punc = pipeline( task=Tasks.punctuation, model="damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727", model_revision=None, )

Command:

Details:

Error log: as above

dancingonmoon avatar Oct 25 '23 07:10 dancingonmoon