hdbscan
hdbscan copied to clipboard
Is it possible to use HDBSCAN with the Levenshtein distance?
Is it possible to use HDBSCAN with the Levenshtein distance? My dataset is too large to make a full distance matrix to feed into it.
The Levenshtein distance satisfies the triangle inequality which I understand is one requirement. Is this possible?
In the current implementation it needs to be a distance supported by sklearn's BallTree or KDTree structures, which Levenshtein is not. You can, however, use sparse precomputed distance matrices (using the scipy.sparse format). Thus you could compute distances to k nearest neighbors (for a sufficiently large value of k) and use that.
How could you work out what the k nearest neighbors are?
I think you would want to use an approximate nearest neighbor library than supports levenshtein distance. I believe hnsw in nmslib meets those criteria.