FunASR
FunASR copied to clipboard
Paraformer Online real mode Error @ chunk+vad
according to 图1 FunASR实时语音听写服务架构图, below steps shall do , in order to realize the online stream mode:
- to split 600ms chunk ( 300ms overlayed);
- for each chunk: produce VAD list (actually, most time, only one vad in the list due to short 600ms chunk period)
- for each vad in Vad list: a) do asr (stream_mode), produce vad_txt, b) do punctuation (punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727) and update vad_txt to be punc_txt.
the above workflow seems always produces below error:
vad_segments:{}
vad_segments:{}
vad_segments:{}
vad_segments:{}
vad_segments:{}
vad_segments:{'text': [[0, 480]]}
vad_0: 0.0->7680.0: stride_vad_clip_result:{'text': '啊就'}
punc_0: 0.0->7680.0: stride_vad_clip_result:{'text': '啊,就'}
vad_segments:{'text': [[0, -1]]}
vad_0: 0.0->-16.0: stride_vad_clip_result:{'text': '对'}
punc_0: 0.0->-16.0: stride_vad_clip_result:{'text': '对'}
vad_segments:{}
vad_segments:{'text': [[-1, 1480]]}
vad_0: -16.0->23680.0: stride_vad_clip_result:{'text': '对'}
punc_0: -16.0->23680.0: stride_vad_clip_result:{'text': '对'}
vad_segments:{}
vad_segments:{}
vad_segments:{'text': [[970, -1]]}
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/gradio/routes.py", line 534, in predict
output = await route_utils.call_process_api(
File "/opt/conda/lib/python3.8/site-packages/gradio/route_utils.py", line 226, in call_process_api
output = await app.get_blocks().process_api(
File "/opt/conda/lib/python3.8/site-packages/gradio/blocks.py", line 1550, in process_api
result = await self.call_function(
File "/opt/conda/lib/python3.8/site-packages/gradio/blocks.py", line 1185, in call_function
prediction = await anyio.to_thread.run_sync(
File "/opt/conda/lib/python3.8/site-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/opt/conda/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/opt/conda/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/opt/conda/lib/python3.8/site-packages/gradio/utils.py", line 661, in wrapper
response = f(*args, **kwargs)
File "ParaformerOnline.py", line 113, in RUN
speech_stride_vad_clip_result = inference_pipeline(
File "/opt/conda/lib/python3.8/site-packages/modelscope/pipelines/audio/asr_inference_pipeline.py", line 258, in __call__
output = self.forward(output, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/modelscope/pipelines/audio/asr_inference_pipeline.py", line 511, in forward
inputs['asr_result'] = self.run_inference(self.cmd, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/modelscope/pipelines/audio/asr_inference_pipeline.py", line 586, in run_inference
asr_result = self.funasr_infer_modelscope(cmd['name_and_type'],
File "/opt/conda/lib/python3.8/site-packages/funasr/bin/asr_inference_launch.py", line 1336, in _forward
asr_result = speech2text(cache, raw_inputs, input_lens)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/funasr/bin/asr_infer.py", line 802, in __call__
if feats.shape[1] != 0:
IndexError: tuple index out of range
vad_segments:{}
vad_segments:{}
vad_segments:{}
vad_segments:{'text': [[440, 980]]}
vad_0: 7040.0->15680.0: stride_vad_clip_result:{}
the output becomes like : 我是哎那不起你飞机平时是真不是嗯没样工业 no seen of punc inside.
my code is as below:
def RUN(audio_stream, speech_txt):
"""
speech_txt: 之前累计的识别文本punc后的列表
"""
samplerate = 16000
speech = np.empty(shape=None, dtype=np.float32)
speech, samplerate = librosa.load(
audio_stream,
sr=16000,
mono=True,
offset=0.0,
duration=None,
dtype=np.float32,
)
speech_length = speech.shape[0]
# speech_stride_vad_segments = inference_pipeline_vad(audio_in=speech)
sample_offset = 0
chunk_size = [5, 10, 5] # 第一个5为左看5帧,10为text_n 10帧,为600ms, 第二个5为右看5帧
# stride_size = chunk_size[1] * 960 # 为什么是960 ?
stride_size = chunk_size[1] * 960
param_dict = {"cache": dict(), "is_final": False, "chunk_size": chunk_size}
param_punc = {"cache": []}
# final_result = ""
punc_list = [] # list以送入punc online mode
# punc_list.append(speech_txt) # 将之前的punc后的txt,作为vad之后的第一个,再送入pineline_punc()
for sample_offset in range(
0, speech_length, min(stride_size, speech_length - sample_offset)
):
if sample_offset + stride_size >= speech_length - 1:
stride_size = speech_length - sample_offset
param_dict["is_final"] = True
else:
param_dict["is_final"] = False
speech_stride = speech[sample_offset : sample_offset + stride_size]
# 1. VAD:
speech_stride_vad_segments = inference_pipeline_vad(
audio_in=speech_stride, param_dict=param_dict
)
print(f"vad_segments:{speech_stride_vad_segments}")
# 2. ASR:
if "text" in speech_stride_vad_segments: # 检测到音频数据后,每隔600ms进行一次流式模型推理
for i, segments in enumerate(speech_stride_vad_segments["text"]):
beg_idx = segments[0] * samplerate / 1000
end_idx = segments[1] * samplerate / 1000
speech_stride_vad_clip = speech_stride[int(beg_idx) : int(end_idx)]
speech_stride_vad_clip_result = inference_pipeline(
audio_in=speech_stride_vad_clip, param_dict=param_dict
)
print(
f"vad_{i}: {beg_idx}->{end_idx}: stride_vad_clip_result:{speech_stride_vad_clip_result}"
)
# 3. punc
if "text" in speech_stride_vad_clip_result: # 语音识别出非空时,
speech_stride_vad_clip_punc_result = inference_pipeline_punc(
text_in=speech_stride_vad_clip_result['text'],
param_dict=param_punc)
print(
f"punc_{i}: {beg_idx}->{end_idx}: stride_vad_clip_result:{speech_stride_vad_clip_punc_result}")
else:
speech_stride_vad_clip_punc_result = speech_stride_vad_clip_result
# 3. punc online:
if 'text' in speech_stride_vad_clip_punc_result:
speech_txt += speech_stride_vad_clip_punc_result["text"]
return speech_txt, speech_txt
could anyone help figure out what the error stands for ? and how to figure it out ?
OS: [e.g. linux] linux
Python/C++ Version:3.8
Package Version:pytorch、torchaudio、modelscope、funasr version(pip list) as Modescope Notebook
Model: inference_pipeline_vad = pipeline( task=Tasks.voice_activity_detection, model="damo/speech_fsmn_vad_zh-cn-16k-common-pytorch", model_revision=None, output_dir=output_dir, batch_size=1, mode="online", ) inference_pipeline = pipeline( task=Tasks.auto_speech_recognition, model="damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online", model_revision=None, # update_model=False, mode="paraformer_streaming", output_dir=output_dir, ) inference_pipeline_punc = pipeline( task=Tasks.punctuation, model="damo/punc_ct-transformer_zh-cn-common-vad_realtime-vocab272727", model_revision=None, )
Command:
Details:
Error log: as above