hdbscan icon indicating copy to clipboard operation
hdbscan copied to clipboard

Is it possible to use HDBSCAN with the Levenshtein distance?

Open lesshaste opened this issue 4 years ago • 3 comments

Is it possible to use HDBSCAN with the Levenshtein distance? My dataset is too large to make a full distance matrix to feed into it.

The Levenshtein distance satisfies the triangle inequality which I understand is one requirement. Is this possible?

lesshaste avatar Mar 15 '21 11:03 lesshaste

In the current implementation it needs to be a distance supported by sklearn's BallTree or KDTree structures, which Levenshtein is not. You can, however, use sparse precomputed distance matrices (using the scipy.sparse format). Thus you could compute distances to k nearest neighbors (for a sufficiently large value of k) and use that.

lmcinnes avatar Mar 15 '21 15:03 lmcinnes

How could you work out what the k nearest neighbors are?

lesshaste avatar Mar 15 '21 16:03 lesshaste

I think you would want to use an approximate nearest neighbor library than supports levenshtein distance. I believe hnsw in nmslib meets those criteria.

lmcinnes avatar Mar 16 '21 01:03 lmcinnes