FunASR icon indicating copy to clipboard operation
FunASR copied to clipboard

Bugs of VAD and speaker model for English Audio

Open ruifengma opened this issue 1 year ago • 0 comments

I got the following model composition for English audio speech recognition with speaker classification

funasr_model = AutoModel(model="iic/speech_paraformer_asr-en-16k-vocab4199-pytorch",
                            vad_model="damo/speech_fsmn_vad_zh-cn-16k-common-pytorch",
                            punc_model="damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch",
                            spk_model="damo/speech_campplus_sv_zh-cn_16k-common",
                            )

And I meet following questions:

  1. There are only 2 speakers, but the model output 5 speakers (spk0, spk1, spk2, spk3 and spk5)
  2. you may notice that there 5 speakers but there is not spk4 but a spk5
  3. The VAD model always put several speakers in one chunk
  • OS (e.g., Linux): Cent OS7
  • FunASR Version (e.g., 1.0.0): 1.1.6
  • ModelScope Version (e.g., 1.11.0): 1.18.1
  • PyTorch Version (e.g., 2.0.0): 2.4.1
  • How you installed funasr (pip, source): pip
  • Python version:
  • GPU (e.g., V100M32) A40
  • CUDA/cuDNN version (e.g., cuda11.7): cuda12.2
  • Docker version (e.g., funasr-runtime-sdk-cpu-0.4.1)
  • Any other relevant information:

ruifengma avatar Sep 27 '24 07:09 ruifengma