FunASR icon indicating copy to clipboard operation
FunASR copied to clipboard

Performance Concern: VAD Processing Time Long for 22-minute Mono Audio

Open JaysonGeng opened this issue 1 year ago • 0 comments

OS: Linux

Python/C++ Version: Python 3.7

Package Version:

Model: damo/speech_fsmn_vad_zh-cn-16k-common-pytorch

Details:

`from modelscope.pipelines import pipeline from modelscope.utils.constant import Tasks

inference_pipeline = pipeline( task=Tasks.auto_speech_recognition, model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch', vad_model='damo/speech_fsmn_vad_zh-cn-16k-common-pytorch', punc_model='damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch', batch_size=64, ) audio_in='/home/FunASR/test_audio/20230807-142709-8026-018529420615-1691389629.668644.wav' rec_result = inference_pipeline(audio_in=audio_in) print(rec_result)`

While using the VAD model, I've observed that it takes approximately 11 seconds to process a 22-minute mono audio. Below are the time metrics related to my processing:

Time taken by VAD: 11.138062238693237 seconds Batch (VAD): 117 Time taken by ASR: 0.9648573398590088 seconds Batch (ASR): 6 Time taken by another ASR step: 0.3074047565460205 seconds Time taken for punctuation: 0.3600809574127197 seconds Compared to the other processing steps, the time taken by VAD is noticeably longer. I'm looking to understand if there are ways to optimize this or if there might be an issue with the way I'm using it.

JaysonGeng avatar Sep 06 '23 10:09 JaysonGeng