sherpa-onnx
sherpa-onnx copied to clipboard
Add pyannote vad (segmentation) model
I would like to use sherpa-onnx for speaker diarization. However the current vad modal (silero) doesn't works well and doesn't detect speech correctly. I tried another onnx model in the project pengzhendong/pyannote-onnx and it detects much better. It's based on onnx too. Can we add this model for sherpa-onnx?
Would you like to contribute?
Would you like to contribute?
Unfortunately, I haven't worked with onnxruntime before, so I'm not sure how to implement it. I assume it should work similarly to the implementation of silero vad?
Would you like to contribute?
Unfortunately, I haven't worked with onnxruntime before, so I'm not sure how to implement it. I assume it should work similarly to the implementation of silero vad?
Ok, we can take a look but not this week. It may take sometime to add it.
Ok, we can take a look but not this week. It may take sometime to add it.
Meanwhile I created basic implementation in Python. It looks accurate
# python3 -m venv venv
# source venv/bin/activate
# pip3 install onnxruntime numpy librosa
# wget https://github.com/pengzhendong/pyannote-onnx/blob/master/pyannote_onnx/segmentation-3.0.onnx
# wget https://github.com/thewh1teagle/sherpa-rs/releases/download/v0.1.0/motivation.wav -Otest.wav
# python3 main.py
import onnxruntime as ort
import librosa
import numpy as np
def init_session(model_path):
opts = ort.SessionOptions()
opts.inter_op_num_threads = 1
opts.intra_op_num_threads = 1
opts.log_severity_level = 3
sess = ort.InferenceSession(model_path, sess_options=opts)
return sess
def read_wav(path: str):
return librosa.load(path, sr=16000)
if __name__ == '__main__':
session = init_session('segmentation-3.0.onnx')
samples, sample_rate = read_wav('test.wav')
# Conv1d & MaxPool1d & SincNet https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html https://pytorch.org/docs/stable/generated/torch.nn.MaxPool1d.html https://github.com/pyannote/pyannote-audio/blob/develop/pyannote/audio/models/blocks/sincnet.py#L50-L71
frame_size = 270
frame_start = 721
window_size = sample_rate * 10 # 10s
# State and offset
is_speeching = False
offset = frame_start
start_offset = 0
# Pad end with silence for full last segment
samples = np.pad(samples, (0, window_size), 'constant')
for start in range(0, len(samples), window_size):
window = samples[start:start + window_size]
ort_outs: np.array = session.run(None, {'input': window[None, None, :]})[0][0]
for probs in ort_outs:
predicted_id = np.argmax(probs)
if predicted_id != 0:
if not is_speeching:
start_offset = offset
is_speeching = True
elif is_speeching:
start = round(start_offset / sample_rate, 3)
end = round(offset / sample_rate, 3)
print(f'{start}s - {end}s')
is_speeching = False
offset += frame_size
@thewh1teagle , how accurate? could you do me a favor to test the sherpa-onnx/sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01/test_wavs/4.wav file with your code and paste all the "start-end" here? I'm comparing syllable segmentation tools in this issue: https://github.com/k2-fsa/sherpa-onnx/issues/920#issuecomment-2306056642
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/kws-models/sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01.tar.bz2
Sorry, I mis-understood, the segmentation-3.0.onnx is not a syllable-level VAD, it can only detect the beginning and end of a sentence:
The thetaOscillator-syllable-segmentation is better (but not enough good): https://github.com/k2-fsa/sherpa-onnx/issues/920#issuecomment-2300415649
Fixed in the latest master
@thewh1teagle @csukuangfj I'm wrong, the pyannote segmentation-3.0.onnx indeed can segment syllables(mandarin pinyin), for sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01_test_wavs_4.wav, it can segment the first 7 syllables, but the last 5 syllables are not so accurate: https://github.com/diyism/pyannote_segment_syllables
$ git clone https://github.com/diyism/pyannote_segment_syllables
$ cd pyannote_segment_syllables/
$ python main.py sherpa-onnx-kws-zipformer-wenetspeech-3.3M-2024-01-01_test_wavs_4.wav
Found 12 syllables:
0.560s - 0.742s
0.742s - 1.066s
1.066s - 1.298s
1.645s - 1.920s
2.035s - 2.203s
2.203s - 2.470s
2.555s - 2.725s
2.725s - 2.960s
3.150s - 3.250s
3.250s - 3.475s
3.550s - 3.760s
3.760s - 3.975s
Saved syllable 001: 0.560s - 0.742s (duration: 0.182s)
Saved syllable 002: 0.742s - 1.066s (duration: 0.324s)
Saved syllable 003: 1.066s - 1.298s (duration: 0.232s)
Saved syllable 004: 1.645s - 1.920s (duration: 0.275s)
Saved syllable 005: 2.035s - 2.203s (duration: 0.167s)
Saved syllable 006: 2.203s - 2.470s (duration: 0.267s)
Saved syllable 007: 2.555s - 2.725s (duration: 0.170s)
Saved syllable 008: 2.725s - 2.960s (duration: 0.235s)
Saved syllable 009: 3.150s - 3.250s (duration: 0.100s)
Saved syllable 010: 3.250s - 3.475s (duration: 0.225s)
Saved syllable 011: 3.550s - 3.760s (duration: 0.210s)
Saved syllable 012: 3.760s - 3.975s (duration: 0.215s)
$ aplay syllables/001.wav
$ aplay syllables/002.wav
$ aplay syllables/003.wav
Maybe you can help me to improve it.
I guess that since the segmentation-3.0.onnx can segment syllables(mandarin pinyin), maybe a very small model can recognize all the 1300 mono-syllable pinyins. While the segmentation-3.0.onnx is only 5.8MB.
And if this tool get improved, maybe it can help build a streamlined custom training process for sherpa-onnx-kws(https://github.com/k2-fsa/sherpa-onnx/issues/1371), so that we users only need to record our own voices(consist of all the 1300 pinyins) to train a custom model.
@diyism I recommend running your tests using the pyannote-audio library. If issues persist, it's likely a problem within the model itself, or at least you can open issue there or investigate it better.