star-clustering
star-clustering copied to clipboard
distance metric, word vectors and references
afaict, you are using euclidean distances, however these are not expected to work well for high dimensional problems. namely for word vectors, one should use cosine distance. these is equivalent to euclidean distance for normalized vectors, however afaict you are not using the normalized vectors, but the raw fasttext vectors which are not normalized. i would suggest: (1) use normalized vectors for word vectors example. (2) allow using a custom metric. (3) is there any writeup explaing star-clustering method? (4) do you have references to other methods doing similar clustering on word vectors? would be interesting to see some comparison for this specific use case which you are highlighting (and is very interesting). thanks!
yeah using cosine distance for the word embeddings is a really good idea and usually provides superior results to euclidean distance for this particular use case, if you add me as a collab i'll see if i can include that as well in the branch with the upper threshold
haven't pushed the code yet, but cosine distance and an adjusted limit constant gave some very nice clusters with clear, well-defined themes for the word vectors.
https://github.com/josephius/star-clustering/blob/feature/upper-threshold/basic_english_limit-0p618_cosine.txt
Commit https://github.com/josephius/star-clustering/commit/8a1d776de9fe9d7dddd8d145835b4954cf7c0017
contains changes adding a new angular distance metric class (https://en.wikipedia.org/wiki/Cosine_similarity#Angular_distance_and_similarity) in a new distances.py file that should allow for fairly hassle-free extension with custom distances should one be so inclined
Seth added some support for custom metrics. As for the writeup and similar methods, there is the Reddit comment thread here: https://www.reddit.com/r/MachineLearning/comments/gsu3zm/p_star_clustering_a_clustering_algorithm_that/
I'm still working on a possible paper/manual/pseudocode writeup.
I would suggest to use scipy.spatial.distance.pdist instead of your distance
module. This will give you access to a large collection of distances (euclidean, minkowski, cityblock, cosine, correlation, hamming, jaccard, mahalanobis...) ;)
I would suggest to use scipy.spatial.distance.pdist instead of your
distance
module. This will give you access to a large collection of distances (euclidean, minkowski, cityblock, cosine, correlation, hamming, jaccard, mahalanobis...) ;)
i tried using the cosine distance from scipy.spatial at first, but switched to custom distance code when it appeared that the scipy distance library could only compare one vector vs one other vector.
as the number of items in the dataset grows, doing matrix/matrix or vector/matrix multiplications instead of individual vector/vector multiplications for each distance starts to become much more efficient. are you aware of any way that one can pass matrices instead of vectors to the scipy.spatial distance methods to enable more efficient math operations in the distance calculations?
@shy1 It looks like cosine from scipy.spatial.distance is indeed for 1-D arrays, but pdist, which @rfezzani suggested works on N-D arrays, which are essentially tensors, so that should work for matrices.