FunASR icon indicating copy to clipboard operation
FunASR copied to clipboard

Onnx vad 的结果与pipeline vad的结果偏差较大

Open fatmop opened this issue 2 years ago • 0 comments

OS: mac

Python/C++ Version:python 3.7.0

Package Version:torch 1.13.1、torchaudio 0.13.1、modelscope 1.6.1、funasr version 0.7.6

Model:damo/speech_fsmn_vad_zh-cn-16k-common-pytorch、damo/speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404

Command: 音频和代码.zip 大体上是对同一段音频,分别利用pipeline的vad和onnx的vad,然后对两个vad的结果进行onnx的asr,结果是,因为vad结果的差异,造成asr里有些静音被识别成中文。

   `
    # 原始文件8K PCM TO 16K PCM
    speech = resample(fs=8000, audio_in=speech)
    # pipeline vad
    vad_result = vad_pipeline(audio_in=speech)
    # onnx vad
    data_ndarray = np.frombuffer(speech, dtype=np.int16)
    data_ndarray = data_ndarray.astype(np.float32).T  # 转
    vad_onnx_result = vad_onnx(audio_in=data_ndarray)
    print('filename ' + filename)
    print('vad pipeline result ' + str(vad_result))
    print('vad onnx result ' + str(vad_onnx_result))
    
    param_dict = dict()
    param_dict['hotword'] = hotwords


    # pipeline vad 后的asr
    sentences = []
    text_array = vad_result['text'] if 'text' in vad_result else []
    start_time = time.time()

    for i in range(0, len(text_array)):
        data = speech[text_array[i][0] * 32: text_array[i][1] * 32]
        data_ndarray = np.frombuffer(data, dtype=np.int16)
        data_ndarray = data_ndarray.astype(np.float32).T  # 转
        rec_result = asr_onnx(wav_content=data_ndarray, hotwords=hotwords)
        text = rec_result[0]['preds'][0] if len(rec_result) > 0 and 'preds' in rec_result[0] and len(rec_result[0]['preds']) > 0 else ''
        if text:
            sentences.append(text)
    end_time = time.time()
    print(f'pipeline vad to asr time:{end_time - start_time} {"".join(sentences)}')

    # onnx vad 后的asr

    sentences = []
    text_array = vad_onnx_result[0] if len(vad_onnx_result) > 0 else []
    start_time = time.time()

    for i in range(0, len(text_array)):
        data = speech[text_array[i][0] * 32: text_array[i][1] * 32]
        data_ndarray = np.frombuffer(data, dtype=np.int16)
        data_ndarray = data_ndarray.astype(np.float32).T  # 转
        rec_result = asr_onnx(wav_content=data_ndarray, hotwords=hotwords)
        text = rec_result[0]['preds'][0] if len(rec_result) > 0 and 'preds' in rec_result[0] and len(
            rec_result[0]['preds']) > 0 else ''
        if text:
            sentences.append(text)
    end_time = time.time()
    print(f'onnx vad to asr time:{end_time - start_time} {"".join(sentences)}')

`

输出结果

`

  filename 20230809153849_1.wav
  vad pipeline result {'text': [[1820, 3040]]}
  vad onnx result [[[0, 510], [1840, 3040]]]
  pipeline vad to asr time:0.26085901260375977 汉华风电
  onnx vad to asr time:0.6346349716186523 这个这个这个这个汉华风电
  
  filename 20230809142249_1.wav
  vad pipeline result {'text': [[690, 3040]]}
  vad onnx result [[[0, 450], [730, 2200], [2480, 3040]]]
  pipeline vad to asr time:0.32900404930114746 河寨运维站河寨
  onnx vad to asr time:0.7952888011932373 这个的这个这个河寨运维站第二遍我们
  
  filename 20230809172715_1.wav
  vad pipeline result {'text': [[1790, 2380]]}
  vad onnx result [[[0, 450], [1700, 2380]]]
  pipeline vad to asr time:0.2934410572052002 黄岩
  onnx vad to asr time:0.5049970149993896 这个这这个这个黄园
  
  filename 20230809173125_1.wav
  vad pipeline result {'text': [[1320, 2490]]}
  vad onnx result [[[0, 540], [1310, 2420]]]
  pipeline vad to asr time:0.48233509063720703 黄陵变
  onnx vad to asr time:0.5249941349029541 是的是这知这的黄陵变
  
  filename 20230809173224_1.wav
  vad pipeline result {'text': [[940, 6290], [6890, 9080]]}
  vad onnx result [[[0, 570], [940, 6260], [6900, 9160]]]
  pipeline vad to asr time:0.6725819110870361 黄元变瑞良变庄头运维班庄头无人站渭南变电运维班
  onnx vad to asr time:0.8834211826324463 是的黄元变瑞良变庄头运维班庄头无人站渭南变电运维班
  
  filename 20230809160524_1.wav
  vad pipeline result {'text': [[1340, 1940]]}
  vad onnx result [[[0, 3050]]]
  pipeline vad to asr time:0.2459859848022461 现在是
  onnx vad to asr time:0.3040289878845215 先改回搬迁再试再看看呗
  
  filename 20230809172843_1.wav
  vad pipeline result {'text': [[940, 8450]]}
  vad onnx result [[[0, 640], [940, 8450]]]
  pipeline vad to asr time:0.44739603996276855 黄陵变王源变瑞良变泰陵变万泉变煤化变渭南监控潼关变
  onnx vad to asr time:0.6608140468597412 好的好的我的我的好个黄陵变王源变瑞良变泰陵变万泉变煤化变渭南监控潼关变
  
  filename 20230809173143_1.wav
  vad pipeline result {'text': [[640, 7620]]}
  vad onnx result [[[0, 540], [820, 7570]]]
  pipeline vad to asr time:0.41110992431640625 万泉变煤化变渭南监控西庄无人站桢州无人站潼关变
  onnx vad to asr time:0.628960132598877 这个一个这的一c万泉变煤化变渭南监控西庄无人站桢州无人站潼关变
  
  filename 20230809173321_1.wav
  vad pipeline result {'text': [[1530, 8460]]}
  vad onnx result [[[0, 420], [1470, 8420]]]
  pipeline vad to asr time:0.40081286430358887 咸阳监控班大杨运维班池阳变无人站大杨变无人站王源变
  onnx vad to asr time:0.6393678188323975 这个这个这个咸阳监控班大杨运维班池阳变无人站大杨变无人站王源变
  
  filename 20230809164759_1.wav
  vad pipeline result {'text': [[1160, 6200], [6530, 11470]]}
  vad onnx result [[[300, 870], [1160, 6180], [6540, 11500], [12560, 13110]]]
  pipeline vad to asr time:0.7299599647521973 王源变瑞良变庄头运维班庄头变无人站后基变无人站渭南变电运维班渭南变无人站
  onnx vad to asr time:1.17242431640625 啊啊啊啊啊一啊王源变瑞良变庄头运维班庄头变无人站后基变无人站渭南变电运维班渭南变无人站好的的好的
  
  filename 20230809153921_1.wav
  vad pipeline result {'text': [[1740, 3050]]}
  vad onnx result [[[0, 550], [1840, 3050]]]
  pipeline vad to asr time:0.2721059322357178 汉华风电
  onnx vad to asr time:0.4795188903808594 这个意个这个东西汉华风电
  
  filename 20230809160202_1.wav
  vad pipeline result {'text': [[1660, 3050]]}
  vad onnx result [[[0, 650], [1670, 3050]]]
  pipeline vad to asr time:0.26795387268066406 南郊变
  onnx vad to asr time:0.4871838092803955 这个这个这这个周台南郊变
  
  filename 20230809141426_1.wav
  vad pipeline result {'text': [[290, 3030]]}
  vad onnx result [[[0, 3030]]]
  pipeline vad to asr time:0.30504918098449707 白土岭巨亭水电西花
  onnx vad to asr time:0.30647802352905273 白蒲岭巨亭水电西花水
  
  filename 20230809142224_1.wav
  vad pipeline result {'text': [[690, 3040]]}
  vad onnx result [[[0, 3040]]]
  pipeline vad to asr time:0.28829526901245117 南郊变训善变
  onnx vad to asr time:0.2992360591888428 南郊变迅善店
  
  filename 20230809161925_1.wav
  vad pipeline result {'text': [[710, 8730]]}
  vad onnx result [[[0, 7250], [7530, 8730]]]
  pipeline vad to asr time:0.45101428031921387 泰陵变澄县变电运维班万泉变煤化变渭南监控韩城变电
  onnx vad to asr time:0.6924371719360352 泰陵变澄县变电运维班万泉变煤化变渭南监控韩城变变为
  
  filename 20230809170806_1.wav
  vad pipeline result {'text': [[560, 11410]]}
  vad onnx result [[[0, 11400]]]
  pipeline vad to asr time:0.5799620151519775 房元变瑞良变庄头运维班庄头变无人站后稷变无人站渭南变电运维班渭南变无人站
  onnx vad to asr time:0.567924976348877 黄元变瑞良变庄头运维班庄头变无人站后稷变无人站渭南变变运维班渭南变无人站

`

fatmop avatar Sep 12 '23 23:09 fatmop