sherpa-onnx
sherpa-onnx copied to clipboard
Mismatch diarization result between pyannote/speaker-diarization-3.0 and k2-fsa/speaker-diarization
I attempted to diarize the audio clip using the same model, but I obtained different results. Is this a known issue related to the ONNX format, or did I make a mistake in my process?
I have checked the pipeline of the pyannote/speaker-diarization-3.0 and select the same model as provided in sherpa-onnx
How to reproduce
pyannote/speaker-diarization-3.0
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.0",
use_auth_token="change_to_your_huggingface_token")
diarization = pipeline("ck-interview-mono.wav")
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")
Output
start=0.0s stop=5.2s speaker_SPEAKER_00
start=6.0s stop=23.0s speaker_SPEAKER_00
start=23.8s stop=33.2s speaker_SPEAKER_00
start=33.3s stop=41.4s speaker_SPEAKER_00
start=42.2s stop=43.0s speaker_SPEAKER_00
start=43.0s stop=48.0s speaker_SPEAKER_01
start=48.7s stop=50.2s speaker_SPEAKER_01
start=50.5s stop=61.9s speaker_SPEAKER_01
start=62.2s stop=71.3s speaker_SPEAKER_01
start=71.5s stop=72.0s speaker_SPEAKER_00
start=71.9s stop=72.7s speaker_SPEAKER_01
start=73.5s stop=74.6s speaker_SPEAKER_00
k2-fsa/speaker-diarization
Ran on https://huggingface.co/spaces/k2-fsa/speaker-diarization
- speaker embedding model:
wespeaker_en_voxceleb_resnet34_LM.onnx|26MB - speaker segmentation model:
pyannote/segmentation-3.0 - Number of speakers:
2
Output
0.031 -- 5.228 speaker_00
6.038 -- 23.048 speaker_00
23.825 -- 32.971 speaker_00
33.562 -- 41.375 speaker_00
42.151 -- 47.990 speaker_00
48.732 -- 72.728 speaker_00
73.522 -- 74.602 speaker_00
Additionally, I conducted a comparison of the embedding models using cosine similarity. The similarity score was nearly 1, indicating that the embeddings generated by both models were almost orthogonal.
from pyannote.audio import Model
from pyannote.audio import Inference
import sherpa_onnx
from scipy.spatial.distance import cdist
model = Model.from_pretrained("pyannote/wespeaker-voxceleb-resnet34-LM")
inference = Inference(model, window="whole")
audio_fp = "change_to_your_audio_filepath"
embedding_pyannote = inference(audio_fp)
config = sherpa_onnx.SpeakerEmbeddingExtractorConfig(model = "/Users/kridtaphadsae-khow/.cache/huggingface/hub/models--csukuangfj--speaker-embedding-models/snapshots/0743f301363dec56491a490f6d6cbc9d67f9a3bf/wespeaker_en_voxceleb_resnet34_LM.onnx", num_threads = 1, debug=True, provider = "cpu")
extractor = sherpa_onnx.SpeakerEmbeddingExtractor(config)
audio, sample_rate = read_wave(audio_fp)
stream = extractor.create_stream()
stream.accept_waveform(sample_rate=sample_rate, waveform=audio)
embedding_sherpa = np.asarray(extractor.compute(stream))
distance = cdist(np.expand_dims(embedding_pyannote, axis=0), np.expand_dims(embedding_sherpa, axis=0), metric="cosine")
print(distance)
>> array([[0.82130009]])
The similarity score was nearly 1, indicating that the embeddings generated by both models were almost orthogonal.
If it is nearly 0, then you can consider them almost orthogonal.
If it is nearly 1, then you cannot say they are almost orthogonal.
Can you share ck-interview-mono.wav ?
The similarity score was nearly 1, indicating that the embeddings generated by both models were almost orthogonal.
If it is nearly 0, then you can consider them almost orthogonal.
If it is nearly 1, then you cannot say they are almost orthogonal.
In the context of the scipy implementation, 1 indicates orthogonality, while 0 signifies parallelism.
I see what you mean now
cosine_distance = 1 - similariy_score
audio, sample_rate = read_wave(audio_fp)
Please show the complete code.
what is read_wave?
audio, sample_rate = read_wave(audio_fp)
Please show the complete code.
what is
read_wave?
I used the read_wave as you provided in the https://huggingface.co/spaces/k2-fsa/speaker-diarization/blob/main/model.py#L26-L48
@csukuangfj any update ?
sorry, will.check it after the Chinese New Year.
Any progress so far? I've been testing speaker-diarization with crosstalk audio and it's not working well either
Any progress so far? I've been testing speaker-diarization with crosstalk audio and it's not working well either
they released new models and they claim that its better with overlap speech. I converted their segment model into onnx but cant achieve the convert the identification model: https://huggingface.co/pyannote/speaker-diarization-community-1 fyi: https://github.com/pyannote/pyannote-audio/discussions/1929