FunASR icon indicating copy to clipboard operation
FunASR copied to clipboard

fsmn vad 如何精确定位句尾?

Open bigmisspanda opened this issue 1 year ago • 1 comments

在我的音频中老师的提问和学生的回答时间间隔为1秒左右, 我减少max_end_silence_time为500ms尝试精确定位句尾,但是没有效果,无法精确分离老师和学生的话,请问还可以尝试什么配置呢? vad_model模型配置如下:

frontend: WavFrontendOnline
frontend_conf:
    fs: 16000
    window: hamming
    n_mels: 80
    frame_length: 25
    frame_shift: 10
    dither: 0.0
    lfr_m: 5
    lfr_n: 1

model: FsmnVADStreaming
model_conf:
    sample_rate: 16000
    detect_mode: 1
    snr_mode: 0
    max_end_silence_time: 500
    max_start_silence_time: 3000
    do_start_point_detection: True
    do_end_point_detection: True
    window_size_ms: 200
    sil_to_speech_time_thres: 150
    speech_to_sil_time_thres: 150
    speech_2_noise_ratio: 1.0
    do_extend: 1
    lookback_time_start_point: 200
    lookahead_time_end_point: 100
    max_single_segment_time: 60000
    snr_thres: -100.0
    noise_frame_num_used_for_snr: 100
    decibel_thres: -100.0
    speech_noise_thres: 0.6
    fe_prior_thres: 0.0001
    silence_pdf_num: 1
    sil_pdf_ids: [0]
    speech_noise_thresh_low: -0.1
    speech_noise_thresh_high: 0.3
    output_frame_probs: False
    frame_in_ms: 10
    frame_length_ms: 25

encoder: FSMN
encoder_conf:
    input_dim: 400
    input_affine_dim: 140
    fsmn_layers: 4
    linear_dim: 250
    proj_dim: 128
    lorder: 20
    rorder: 0
    lstride: 1
    rstride: 0
    output_affine_dim: 140
    output_dim: 248

代码如下:

import os
#os.environ["CUDA_VISIBLE_DEVICES"] = "-1,"
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

if __name__ == '__main__':
    audio_in = 'https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/asr_speaker_demo.wav'
    audio_in2 = '/home/STAna/stana/file/c28ab37273fc9a91b7722b963b320aff.wav'
    output_dir = "./results"
    inference_pipeline = pipeline(
        task=Tasks.auto_speech_recognition,
        model='iic/speech_paraformer-large-vad-punc-spk_asr_nat-zh-cn',
        model_revision='v2.0.4',
        vad_model='iic/speech_fsmn_vad_zh-cn-16k-common-pytorch', vad_model_revision="v2.0.4",
        punc_model='iic/punc_ct-transformer_cn-en-common-vocab471067-large', punc_model_revision="v2.0.4",
        spk_model="iic/speech_campplus_sv_zh-cn_16k-common",
      spk_model_revision="v2.0.2",
        output_dir=output_dir,
    )
    rec_result = inference_pipeline(audio_in2,batch_size_s=300, batch_size_token_threshold_s=40)
    print(rec_result)


What have you tried?

What's your environment?

  • OS (e.g., Linux):
  • FunASR Version (e.g., 1.0.17):
  • ModelScope Version (e.g., 1.13.0):
  • PyTorch Version (e.g., 2.0.0):
  • How you installed funasr (pip, source):
  • Python version:
  • GPU (e.g., V100M32)
  • CUDA/cuDNN version (e.g., cuda11.7):
  • Docker version (e.g., funasr-runtime-sdk-cpu-0.4.1)
  • Any other relevant information:

bigmisspanda avatar Mar 21 '24 05:03 bigmisspanda

max_end_silence_time=100

LauraGPT avatar Mar 22 '24 06:03 LauraGPT