FunASR
FunASR copied to clipboard
当处理长度为1个采样点的音频时,load_audio_text_image_video 函数存在bug
Notice: In order to resolve issues more efficiently, please raise issue following the template. (注意:为了更加高效率解决您遇到的问题,请按照模板提问,补充细节)
🐛 Bug
我在用 paraformer-zh-large-stream 模型对一批音频进行 实时语音识别(流式),以下是我用的代码(按照modelscope上推荐的模板):
# From https://www.modelscope.cn/models/iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-online
from funasr import AutoModel
chunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention
model = AutoModel(model="paraformer-zh-streaming", model_revision="v2.0.4")
import soundfile
import os
wav_file = os.path.join(model.model_path, "example/asr_example.wav")
speech, sample_rate = soundfile.read(wav_file)
chunk_stride = chunk_size[1] * 960 # 600ms
cache = {}
total_chunk_num = int(len((speech)-1)/chunk_stride+1)
for i in range(total_chunk_num):
speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
is_final = i == total_chunk_num - 1
res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size, encoder_chunk_look_back=encoder_chunk_look_back, decoder_chunk_look_back=decoder_chunk_look_back)
print(res)
它在大部分音频上都正常工作,但是在部分音频上会报错,报错信息:
File "/lib/python3.8/site-packages/funasr/models/paraformer_streaming/model.py", line 600, in inference
audio_sample = torch.cat((cache["prev_samples"], audio_sample_list[0]))
RuntimeError: zero-dimensional tensor (at position 1) cannot be concatenated
这条16K单通道音频具有 67201 个采样点, main函数中将它切分了8次,前7次切片长度为 9600 ,第8次切片长度为 1.
切片后的样本由 funasr/utils/load_utils.py 中 load_load_audio_text_image_video 函数加载为 tensor
data_or_path_or_list = torch.from_numpy(data_or_path_or_list).squeeze() # [n_samples,]
这行代码在大部分场景中都按照预期工作,但是当输入的 data_or_path_or_list
长度为1时,此时转成的 tensor 经过 squeeze()后维度消失,引发了这个错误。
对于这条音频,8次调用 load_load_audio_text_image_video 函数得到的tensor形状如下:
- 当测试音频的采样点为 67200 与 67202 时,不会有问题。
To Reproduce
这些信息可能对复现有帮助: torch - 2.3.1 funasr - 1.1.4 numpy - 1.24.4
所执行的代码、加载的模型、底层依赖的第三方包都已在上述给出,为了复现这个问题,需要额外做以下操作: 1)准备一个采样点个数为 67201的 16K 单通道 wave 格式音频;
# 67202
sox -n -b 16 -r 16000 output.wav synth 4.2001 sine 400
# 67201
sox -n -b 16 -r 16000 output.wav synth 4.20005 sine 400
# 67200
sox -n -b 16 -r 16000 output.wav synth 4.200005 sine 400
2)按照上述的主脚本做一次流式语音识别
对于这个测试用例,这样的代码可以通过测试:
- data_or_path_or_list = torch.from_numpy(data_or_path_or_list).squeeze() # [n_samples,]
+ data_or_path_or_list = torch.from_numpy(data_or_path_or_list) # [n_samples,]
由于对 funasr 了解不够全面,无法针对 load_load_audio_text_image_video 函数给出覆盖全面的测试用例,因此没有提pr;