TransmogrifAI icon indicating copy to clipboard operation
TransmogrifAI copied to clipboard

Find good JVM implementation of DBSCAN and wrap it into TransmogrifAI

Open leahmcguire opened this issue 5 years ago • 2 comments

Problem We would like to offer DBSCAN as an algorithm in TransmogrifAI but there is no spark implementation.

Solution Find an implementation of DBSCAN (https://en.wikipedia.org/wiki/DBSCAN) in the JVM with good support and edge case coverage and wrap it into TransmogrifAI. See https://docs.transmogrif.ai/en/stable/developer-guide/index.html#wrapping-a-non-serializable-external-library and https://docs.transmogrif.ai/en/stable/developer-guide/index.html#writing-your-own-estimator

Alternatives If you feel you can cover all the edge cases you could also write a new implementation of DBSCAN

leahmcguire avatar Aug 29 '18 18:08 leahmcguire

I started by following this discussion and then reading the paper revisited DBSCAN. It seems that we should aim for ρ-approximate DBSCAN instead, since that one is the only feasible one complexity wise.

Here is a good survey on existing libraries implementing DBSCAN algorithm (most of them only support d <= 3) and don't run fast at all.

tovbinm avatar Sep 01 '18 04:09 tovbinm

I found a newer Java implementation of DBSCAN in SMILE library. We can try and port or to Spark I presume.

tovbinm avatar May 23 '19 04:05 tovbinm