sherpa-onnx icon indicating copy to clipboard operation
sherpa-onnx copied to clipboard

Add pyannote vad (segmentation) model

Open thewh1teagle opened this issue 1 year ago • 6 comments

I would like to use sherpa-onnx for speaker diarization. However the current vad modal (silero) doesn't works well and doesn't detect speech correctly. I tried another onnx model in the project pengzhendong/pyannote-onnx and it detects much better. It's based on onnx too. Can we add this model for sherpa-onnx?

thewh1teagle avatar Jul 31 '24 15:07 thewh1teagle

Would you like to contribute?

csukuangfj avatar Aug 01 '24 02:08 csukuangfj

Would you like to contribute?

Unfortunately, I haven't worked with onnxruntime before, so I'm not sure how to implement it. I assume it should work similarly to the implementation of silero vad?

thewh1teagle avatar Aug 01 '24 11:08 thewh1teagle

Would you like to contribute?

Unfortunately, I haven't worked with onnxruntime before, so I'm not sure how to implement it. I assume it should work similarly to the implementation of silero vad?

Ok, we can take a look but not this week. It may take sometime to add it.

csukuangfj avatar Aug 02 '24 04:08 csukuangfj

Ok, we can take a look but not this week. It may take sometime to add it.

Meanwhile I created basic implementation in Python. It looks accurate

# python3 -m venv venv 
# source venv/bin/activate
# pip3 install onnxruntime numpy librosa
# wget https://github.com/pengzhendong/pyannote-onnx/blob/master/pyannote_onnx/segmentation-3.0.onnx
# wget https://github.com/thewh1teagle/sherpa-rs/releases/download/v0.1.0/motivation.wav -Otest.wav
# python3 main.py

import onnxruntime as ort
import librosa
import numpy as np

def init_session(model_path):
    opts = ort.SessionOptions()
    opts.inter_op_num_threads = 1
    opts.intra_op_num_threads = 1
    opts.log_severity_level = 3
    sess = ort.InferenceSession(model_path, sess_options=opts)
    return sess

def read_wav(path: str):
    return librosa.load(path, sr=16000)

if __name__ == '__main__':
    session = init_session('segmentation-3.0.onnx')
    samples, sample_rate = read_wav('test.wav')
    
    # Conv1d & MaxPool1d & SincNet https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html https://pytorch.org/docs/stable/generated/torch.nn.MaxPool1d.html https://github.com/pyannote/pyannote-audio/blob/develop/pyannote/audio/models/blocks/sincnet.py#L50-L71
    frame_size = 270
    frame_start = 721
    window_size = sample_rate * 10 # 10s

    # State and offset
    is_speeching = False
    offset = frame_start
    start_offset = 0

    # Pad end with silence for full last segment
    samples = np.pad(samples, (0, window_size), 'constant') 

    for start in range(0, len(samples), window_size):
        window = samples[start:start + window_size]
        ort_outs: np.array = session.run(None, {'input': window[None, None, :]})[0][0]
        for probs in ort_outs:
            predicted_id = np.argmax(probs)
            if predicted_id != 0:
                if not is_speeching:
                    start_offset = offset
                    is_speeching = True
            elif is_speeching:
                start = round(start_offset / sample_rate, 3)
                end = round(offset / sample_rate, 3)
                print(f'{start}s - {end}s')
                is_speeching = False
            offset += frame_size

thewh1teagle avatar Aug 02 '24 19:08 thewh1teagle

@thewh1teagle , how accurate? could you do me a favor to test the sherpa-onnx/sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/test_wavs/4.wav file with your code and paste all the "start-end" here? I'm comparing syllable segmentation tools in this issue: https://github.com/k2-fsa/sherpa-onnx/issues/920#issuecomment-2306056642

wget https://github.com/k2-fsa/sherpa-onnx/releases/download/kws-models/sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01.tar.bz2

diyism avatar Sep 13 '24 15:09 diyism

Sorry, I mis-understood, the segmentation-3.0.onnx is not a syllable-level VAD, it can only detect the beginning and end of a sentence: Screenshot 2024-09-14 at 16-40-05 2024-09-14-163940_1920x1080_scrot png (PNG Image 1920 × 1080 pixels) — Scaled (82%)

The thetaOscillator-syllable-segmentation is better (but not enough good): https://github.com/k2-fsa/sherpa-onnx/issues/920#issuecomment-2300415649

diyism avatar Sep 14 '24 08:09 diyism

Fixed in the latest master

csukuangfj avatar Oct 09 '24 09:10 csukuangfj

@thewh1teagle @csukuangfj I'm wrong, the pyannote segmentation-3.0.onnx indeed can segment syllables(mandarin pinyin), for sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01_test_wavs_4.wav, it can segment the first 7 syllables, but the last 5 syllables are not so accurate: https://github.com/diyism/pyannote_segment_syllables

$ git clone https://github.com/diyism/pyannote_segment_syllables
$ cd pyannote_segment_syllables/
$ python main.py sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01_test_wavs_4.wav
Found 12 syllables:
0.560s - 0.742s
0.742s - 1.066s
1.066s - 1.298s
1.645s - 1.920s
2.035s - 2.203s
2.203s - 2.470s
2.555s - 2.725s
2.725s - 2.960s
3.150s - 3.250s
3.250s - 3.475s
3.550s - 3.760s
3.760s - 3.975s
Saved syllable 001: 0.560s - 0.742s (duration: 0.182s)
Saved syllable 002: 0.742s - 1.066s (duration: 0.324s)
Saved syllable 003: 1.066s - 1.298s (duration: 0.232s)
Saved syllable 004: 1.645s - 1.920s (duration: 0.275s)
Saved syllable 005: 2.035s - 2.203s (duration: 0.167s)
Saved syllable 006: 2.203s - 2.470s (duration: 0.267s)
Saved syllable 007: 2.555s - 2.725s (duration: 0.170s)
Saved syllable 008: 2.725s - 2.960s (duration: 0.235s)
Saved syllable 009: 3.150s - 3.250s (duration: 0.100s)
Saved syllable 010: 3.250s - 3.475s (duration: 0.225s)
Saved syllable 011: 3.550s - 3.760s (duration: 0.210s)
Saved syllable 012: 3.760s - 3.975s (duration: 0.215s)

$ aplay syllables/001.wav
$ aplay syllables/002.wav
$ aplay syllables/003.wav

Maybe you can help me to improve it.

I guess that since the segmentation-3.0.onnx can segment syllables(mandarin pinyin), maybe a very small model can recognize all the 1300 mono-syllable pinyins. While the segmentation-3.0.onnx is only 5.8MB.

And if this tool get improved, maybe it can help build a streamlined custom training process for sherpa-onnx-kws(https://github.com/k2-fsa/sherpa-onnx/issues/1371), so that we users only need to record our own voices(consist of all the 1300 pinyins) to train a custom model.

diyism avatar Nov 03 '24 19:11 diyism

@diyism I recommend running your tests using the pyannote-audio library. If issues persist, it's likely a problem within the model itself, or at least you can open issue there or investigate it better.

thewh1teagle avatar Nov 03 '24 19:11 thewh1teagle