sherpa-onnx Mismatch diarization result between pyannote/speaker-diarization-3.0 and k2-fsa/speaker-diarization

I attempted to diarize the audio clip using the same model, but I obtained different results. Is this a known issue related to the ONNX format, or did I make a mistake in my process?

I have checked the pipeline of the pyannote/speaker-diarization-3.0 and select the same model as provided in sherpa-onnx

How to reproduce

pyannote/speaker-diarization-3.0

from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained(
  "pyannote/speaker-diarization-3.0",
  use_auth_token="change_to_your_huggingface_token")

diarization = pipeline("ck-interview-mono.wav")

for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")

Output

start=0.0s stop=5.2s speaker_SPEAKER_00
start=6.0s stop=23.0s speaker_SPEAKER_00
start=23.8s stop=33.2s speaker_SPEAKER_00
start=33.3s stop=41.4s speaker_SPEAKER_00
start=42.2s stop=43.0s speaker_SPEAKER_00
start=43.0s stop=48.0s speaker_SPEAKER_01
start=48.7s stop=50.2s speaker_SPEAKER_01
start=50.5s stop=61.9s speaker_SPEAKER_01
start=62.2s stop=71.3s speaker_SPEAKER_01
start=71.5s stop=72.0s speaker_SPEAKER_00
start=71.9s stop=72.7s speaker_SPEAKER_01
start=73.5s stop=74.6s speaker_SPEAKER_00

k2-fsa/speaker-diarization

Ran on https://huggingface.co/spaces/k2-fsa/speaker-diarization

speaker embedding model: wespeaker_en_voxceleb_resnet34_LM.onnx|26MB
speaker segmentation model: pyannote/segmentation-3.0
Number of speakers: 2

Output

0.031 -- 5.228 speaker_00
6.038 -- 23.048 speaker_00
23.825 -- 32.971 speaker_00
33.562 -- 41.375 speaker_00
42.151 -- 47.990 speaker_00
48.732 -- 72.728 speaker_00
73.522 -- 74.602 speaker_00

Jan 14 '25 08:01 takipipo

Additionally, I conducted a comparison of the embedding models using cosine similarity. The similarity score was nearly 1, indicating that the embeddings generated by both models were almost orthogonal.

Cosine Similarity Calculation

from pyannote.audio import Model
from pyannote.audio import Inference
import sherpa_onnx
from scipy.spatial.distance import cdist

model = Model.from_pretrained("pyannote/wespeaker-voxceleb-resnet34-LM")
inference = Inference(model, window="whole")
audio_fp = "change_to_your_audio_filepath"

embedding_pyannote = inference(audio_fp)
config = sherpa_onnx.SpeakerEmbeddingExtractorConfig(model = "/Users/kridtaphadsae-khow/.cache/huggingface/hub/models--csukuangfj--speaker-embedding-models/snapshots/0743f301363dec56491a490f6d6cbc9d67f9a3bf/wespeaker_en_voxceleb_resnet34_LM.onnx", num_threads = 1, debug=True, provider = "cpu")
extractor = sherpa_onnx.SpeakerEmbeddingExtractor(config)

audio, sample_rate = read_wave(audio_fp)
stream = extractor.create_stream()
stream.accept_waveform(sample_rate=sample_rate, waveform=audio)
embedding_sherpa = np.asarray(extractor.compute(stream))

distance = cdist(np.expand_dims(embedding_pyannote, axis=0), np.expand_dims(embedding_sherpa, axis=0), metric="cosine")
print(distance)

>> array([[0.82130009]])

Jan 15 '25 06:01 takipipo

The similarity score was nearly 1, indicating that the embeddings generated by both models were almost orthogonal.

If it is nearly 0, then you can consider them almost orthogonal.

If it is nearly 1, then you cannot say they are almost orthogonal.

Jan 15 '25 09:01 csukuangfj

Can you share ck-interview-mono.wav ?

Jan 15 '25 09:01 csukuangfj

Can you share ck-interview-mono.wav ?

audio clip

Jan 15 '25 09:01 takipipo

The similarity score was nearly 1, indicating that the embeddings generated by both models were almost orthogonal.

If it is nearly 0, then you can consider them almost orthogonal.

If it is nearly 1, then you cannot say they are almost orthogonal.

In the context of the scipy implementation, 1 indicates orthogonality, while 0 signifies parallelism.

Jan 15 '25 09:01 takipipo

I see what you mean now

cosine_distance = 1 - similariy_score

Jan 15 '25 09:01 csukuangfj

audio, sample_rate = read_wave(audio_fp)

Please show the complete code.

what is read_wave?

Jan 15 '25 09:01 csukuangfj

audio, sample_rate = read_wave(audio_fp)

Please show the complete code.

what is read_wave?

I used the read_wave as you provided in the https://huggingface.co/spaces/k2-fsa/speaker-diarization/blob/main/model.py#L26-L48

Jan 15 '25 09:01 takipipo

@csukuangfj any update ?

Jan 23 '25 02:01 takipipo

sorry, will.check it after the Chinese New Year.

Jan 23 '25 14:01 csukuangfj

Any progress so far? I've been testing speaker-diarization with crosstalk audio and it's not working well either

Sep 15 '25 09:09 ITCZhuxy

Any progress so far? I've been testing speaker-diarization with crosstalk audio and it's not working well either

they released new models and they claim that its better with overlap speech. I converted their segment model into onnx but cant achieve the convert the identification model: https://huggingface.co/pyannote/speaker-diarization-community-1 fyi: https://github.com/pyannote/pyannote-audio/discussions/1929

Oct 01 '25 08:10 altunenes

sherpa-onnx sherpa-onnx copied to clipboard

Mismatch diarization result between pyannote/speaker-diarization-3.0 and k2-fsa/speaker-diarization

How to reproduce

pyannote/speaker-diarization-3.0

Output

k2-fsa/speaker-diarization

Output

sherpa-onnx
sherpa-onnx copied to clipboard