qdrant-web-ui
qdrant-web-ui copied to clipboard
tSNE vector compression optimised using wasm-bhtsne
#108
The computation with wasm-bhtsne seems to be very fast and tsne appears to be a solid choice! However, did you consider any other algorithms or resources why (bh)tSNE might be superior to e.g. UMAP or PCA?
Some resources:
- Nature Article 2022 comparing algorithms
- Scikit Learn Examples, mainly clustering 2D points but some algorithms work for high-dimensional data too
- Tensorflow Embedding Projector using tSNE, UMAP and PCA
- Reddit Discussion
I'm just curious which algorithm might perform best overall.
Hi @do-me , it looks like the main bottleneck is not the algorithm, but the implementation and ability to use multiple threads for computation. If there is a solid implementation of any of those, we are happy to make it configurable
@kartik-gupta-ij if I saw correctly, the implementation of wasm-bhtsne has all parameters hard coded apart from the iterations number. E.g. due to the perplexity default it crashes with less than 60 vectors (wrote a dirty fix here adding random vectors for the calculation to make it work in either case). Also, depending on the matrix size other defaults might be better suited.
Is anyone interested in forking and adding the lacking params to the wasm version? Unfortunately I'm not experienced in Rust.
Apart from the lack of parameteres it's fairly performant. I tested e.g. with 1566 vectors of size 384, 1000 iterations, takes 20s on my i7 laptop.
@generall This demonstrates a method to do dimensionality reduction based on a HNSW index. It's very fast compared to tSNE. If Qdrant provided a way to produce such results to the frontend based on the collections HNSW index then (in my testing) visualizing even tens of thousands of nodes would take mere seconds.
https://github.com/jean-pierreBoth/annembed/blob/master/src/embedder.rs