whisper.cpp
whisper.cpp copied to clipboard
Diarization
Some unsuccessful experiments with audio embedding clustering
Tried to apply C-means fuzzy clustering on:
- embeddings after the initial convolution in the encoder
- self KV embeddings from each encoder layer
- KQV embeddings from each encoder layer
- embeddings from the last encoder layer
- cross KV embeddings of each decoder layer
Instead of clustering the full embedding dimensions, first reduce dimensionality using SVD:
- decompose the embeddings
E = USV
- compute singular vectors
U' = US
- project
E
onU'
and take the top few coordinates