FunASR
FunASR copied to clipboard
Performance Concern: VAD Processing Time Long for 22-minute Mono Audio
OS: Linux
Python/C++ Version: Python 3.7
Package Version:
Model: damo/speech_fsmn_vad_zh-cn-16k-common-pytorch
Details:
`from modelscope.pipelines import pipeline from modelscope.utils.constant import Tasks
inference_pipeline = pipeline( task=Tasks.auto_speech_recognition, model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch', vad_model='damo/speech_fsmn_vad_zh-cn-16k-common-pytorch', punc_model='damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch', batch_size=64, ) audio_in='/home/FunASR/test_audio/20230807-142709-8026-018529420615-1691389629.668644.wav' rec_result = inference_pipeline(audio_in=audio_in) print(rec_result)`
While using the VAD model, I've observed that it takes approximately 11 seconds to process a 22-minute mono audio. Below are the time metrics related to my processing:
Time taken by VAD: 11.138062238693237 seconds Batch (VAD): 117 Time taken by ASR: 0.9648573398590088 seconds Batch (ASR): 6 Time taken by another ASR step: 0.3074047565460205 seconds Time taken for punctuation: 0.3600809574127197 seconds Compared to the other processing steps, the time taken by VAD is noticeably longer. I'm looking to understand if there are ways to optimize this or if there might be an issue with the way I'm using it.