star-clustering distance metric, word vectors and references

afaict, you are using euclidean distances, however these are not expected to work well for high dimensional problems. namely for word vectors, one should use cosine distance. these is equivalent to euclidean distance for normalized vectors, however afaict you are not using the normalized vectors, but the raw fasttext vectors which are not normalized. i would suggest: (1) use normalized vectors for word vectors example. (2) allow using a custom metric. (3) is there any writeup explaing star-clustering method? (4) do you have references to other methods doing similar clustering on word vectors? would be interesting to see some comparison for this specific use case which you are highlighting (and is very interesting). thanks!

May 29 '20 20:05 eyaler

yeah using cosine distance for the word embeddings is a really good idea and usually provides superior results to euclidean distance for this particular use case, if you add me as a collab i'll see if i can include that as well in the branch with the upper threshold

May 30 '20 16:05 shy1

haven't pushed the code yet, but cosine distance and an adjusted limit constant gave some very nice clusters with clear, well-defined themes for the word vectors.

https://github.com/josephius/star-clustering/blob/feature/upper-threshold/basic_english_limit-0p618_cosine.txt

May 31 '20 00:05 shy1

Commit https://github.com/josephius/star-clustering/commit/8a1d776de9fe9d7dddd8d145835b4954cf7c0017

contains changes adding a new angular distance metric class (https://en.wikipedia.org/wiki/Cosine_similarity#Angular_distance_and_similarity) in a new distances.py file that should allow for fairly hassle-free extension with custom distances should one be so inclined

Jun 01 '20 05:06 shy1

Seth added some support for custom metrics. As for the writeup and similar methods, there is the Reddit comment thread here: https://www.reddit.com/r/MachineLearning/comments/gsu3zm/p_star_clustering_a_clustering_algorithm_that/

I'm still working on a possible paper/manual/pseudocode writeup.

Jun 01 '20 15:06 josephius

I would suggest to use scipy.spatial.distance.pdist instead of your distance module. This will give you access to a large collection of distances (euclidean, minkowski, cityblock, cosine, correlation, hamming, jaccard, mahalanobis...) ;)

Jun 02 '20 12:06 rfezzani

I would suggest to use scipy.spatial.distance.pdist instead of your distance module. This will give you access to a large collection of distances (euclidean, minkowski, cityblock, cosine, correlation, hamming, jaccard, mahalanobis...) ;)

i tried using the cosine distance from scipy.spatial at first, but switched to custom distance code when it appeared that the scipy distance library could only compare one vector vs one other vector.

as the number of items in the dataset grows, doing matrix/matrix or vector/matrix multiplications instead of individual vector/vector multiplications for each distance starts to become much more efficient. are you aware of any way that one can pass matrices instead of vectors to the scipy.spatial distance methods to enable more efficient math operations in the distance calculations?

Jun 06 '20 18:06 shy1

@shy1 It looks like cosine from scipy.spatial.distance is indeed for 1-D arrays, but pdist, which @rfezzani suggested works on N-D arrays, which are essentially tensors, so that should work for matrices.

Jun 06 '20 18:06 josephius

star-clustering star-clustering copied to clipboard

distance metric, word vectors and references

star-clustering
star-clustering copied to clipboard