FunASR
FunASR copied to clipboard
Bugs of VAD and speaker model for English Audio
I got the following model composition for English audio speech recognition with speaker classification
funasr_model = AutoModel(model="iic/speech_paraformer_asr-en-16k-vocab4199-pytorch",
vad_model="damo/speech_fsmn_vad_zh-cn-16k-common-pytorch",
punc_model="damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch",
spk_model="damo/speech_campplus_sv_zh-cn_16k-common",
)
And I meet following questions:
- There are only 2 speakers, but the model output 5 speakers (spk0, spk1, spk2, spk3 and spk5)
- you may notice that there 5 speakers but there is not spk4 but a spk5
- The VAD model always put several speakers in one chunk
- OS (e.g., Linux): Cent OS7
- FunASR Version (e.g., 1.0.0): 1.1.6
- ModelScope Version (e.g., 1.11.0): 1.18.1
- PyTorch Version (e.g., 2.0.0): 2.4.1
- How you installed funasr (
pip, source): pip - Python version:
- GPU (e.g., V100M32) A40
- CUDA/cuDNN version (e.g., cuda11.7): cuda12.2
- Docker version (e.g., funasr-runtime-sdk-cpu-0.4.1)
- Any other relevant information: