sherpa-onnx icon indicating copy to clipboard operation
sherpa-onnx copied to clipboard

Mismatch diarization result between pyannote/speaker-diarization-3.0 and k2-fsa/speaker-diarization

Open takipipo opened this issue 10 months ago • 12 comments

I attempted to diarize the audio clip using the same model, but I obtained different results. Is this a known issue related to the ONNX format, or did I make a mistake in my process?

I have checked the pipeline of the pyannote/speaker-diarization-3.0 and select the same model as provided in sherpa-onnx

How to reproduce

pyannote/speaker-diarization-3.0

from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained(
  "pyannote/speaker-diarization-3.0",
  use_auth_token="change_to_your_huggingface_token")

diarization = pipeline("ck-interview-mono.wav")

for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")

Output

start=0.0s stop=5.2s speaker_SPEAKER_00
start=6.0s stop=23.0s speaker_SPEAKER_00
start=23.8s stop=33.2s speaker_SPEAKER_00
start=33.3s stop=41.4s speaker_SPEAKER_00
start=42.2s stop=43.0s speaker_SPEAKER_00
start=43.0s stop=48.0s speaker_SPEAKER_01
start=48.7s stop=50.2s speaker_SPEAKER_01
start=50.5s stop=61.9s speaker_SPEAKER_01
start=62.2s stop=71.3s speaker_SPEAKER_01
start=71.5s stop=72.0s speaker_SPEAKER_00
start=71.9s stop=72.7s speaker_SPEAKER_01
start=73.5s stop=74.6s speaker_SPEAKER_00

k2-fsa/speaker-diarization

Ran on https://huggingface.co/spaces/k2-fsa/speaker-diarization

  1. speaker embedding model: wespeaker_en_voxceleb_resnet34_LM.onnx|26MB
  2. speaker segmentation model: pyannote/segmentation-3.0
  3. Number of speakers: 2

Output

0.031 -- 5.228 speaker_00
6.038 -- 23.048 speaker_00
23.825 -- 32.971 speaker_00
33.562 -- 41.375 speaker_00
42.151 -- 47.990 speaker_00
48.732 -- 72.728 speaker_00
73.522 -- 74.602 speaker_00

takipipo avatar Jan 14 '25 08:01 takipipo

Additionally, I conducted a comparison of the embedding models using cosine similarity. The similarity score was nearly 1, indicating that the embeddings generated by both models were almost orthogonal.

Cosine Similarity Calculation

from pyannote.audio import Model
from pyannote.audio import Inference
import sherpa_onnx
from scipy.spatial.distance import cdist

model = Model.from_pretrained("pyannote/wespeaker-voxceleb-resnet34-LM")
inference = Inference(model, window="whole")
audio_fp = "change_to_your_audio_filepath"

embedding_pyannote = inference(audio_fp)
config = sherpa_onnx.SpeakerEmbeddingExtractorConfig(model = "/Users/kridtaphadsae-khow/.cache/huggingface/hub/models--csukuangfj--speaker-embedding-models/snapshots/0743f301363dec56491a490f6d6cbc9d67f9a3bf/wespeaker_en_voxceleb_resnet34_LM.onnx", num_threads = 1, debug=True, provider = "cpu")
extractor = sherpa_onnx.SpeakerEmbeddingExtractor(config)

audio, sample_rate = read_wave(audio_fp)
stream = extractor.create_stream()
stream.accept_waveform(sample_rate=sample_rate, waveform=audio)
embedding_sherpa = np.asarray(extractor.compute(stream))

distance = cdist(np.expand_dims(embedding_pyannote, axis=0), np.expand_dims(embedding_sherpa, axis=0), metric="cosine")
print(distance)
>> array([[0.82130009]])

takipipo avatar Jan 15 '25 06:01 takipipo

The similarity score was nearly 1, indicating that the embeddings generated by both models were almost orthogonal.

If it is nearly 0, then you can consider them almost orthogonal.

If it is nearly 1, then you cannot say they are almost orthogonal.

csukuangfj avatar Jan 15 '25 09:01 csukuangfj

Can you share ck-interview-mono.wav ?

csukuangfj avatar Jan 15 '25 09:01 csukuangfj

Can you share ck-interview-mono.wav ?

audio clip

takipipo avatar Jan 15 '25 09:01 takipipo

The similarity score was nearly 1, indicating that the embeddings generated by both models were almost orthogonal.

If it is nearly 0, then you can consider them almost orthogonal.

If it is nearly 1, then you cannot say they are almost orthogonal.

In the context of the scipy implementation, 1 indicates orthogonality, while 0 signifies parallelism.

image

takipipo avatar Jan 15 '25 09:01 takipipo

I see what you mean now

cosine_distance = 1 - similariy_score

csukuangfj avatar Jan 15 '25 09:01 csukuangfj

audio, sample_rate = read_wave(audio_fp)

Please show the complete code.

what is read_wave?

csukuangfj avatar Jan 15 '25 09:01 csukuangfj

audio, sample_rate = read_wave(audio_fp)

Please show the complete code.

what is read_wave?

I used the read_wave as you provided in the https://huggingface.co/spaces/k2-fsa/speaker-diarization/blob/main/model.py#L26-L48

takipipo avatar Jan 15 '25 09:01 takipipo

@csukuangfj any update ?

takipipo avatar Jan 23 '25 02:01 takipipo

sorry, will.check it after the Chinese New Year.

csukuangfj avatar Jan 23 '25 14:01 csukuangfj

Any progress so far? I've been testing speaker-diarization with crosstalk audio and it's not working well either

ITCZhuxy avatar Sep 15 '25 09:09 ITCZhuxy

Any progress so far? I've been testing speaker-diarization with crosstalk audio and it's not working well either

they released new models and they claim that its better with overlap speech. I converted their segment model into onnx but cant achieve the convert the identification model: https://huggingface.co/pyannote/speaker-diarization-community-1 fyi: https://github.com/pyannote/pyannote-audio/discussions/1929

altunenes avatar Oct 01 '25 08:10 altunenes