TransmogrifAI
TransmogrifAI copied to clipboard
Find good JVM implementation of DBSCAN and wrap it into TransmogrifAI
Problem We would like to offer DBSCAN as an algorithm in TransmogrifAI but there is no spark implementation.
Solution Find an implementation of DBSCAN (https://en.wikipedia.org/wiki/DBSCAN) in the JVM with good support and edge case coverage and wrap it into TransmogrifAI. See https://docs.transmogrif.ai/en/stable/developer-guide/index.html#wrapping-a-non-serializable-external-library and https://docs.transmogrif.ai/en/stable/developer-guide/index.html#writing-your-own-estimator
Alternatives If you feel you can cover all the edge cases you could also write a new implementation of DBSCAN
I started by following this discussion and then reading the paper revisited DBSCAN. It seems that we should aim for ρ-approximate DBSCAN instead, since that one is the only feasible one complexity wise.
Here is a good survey on existing libraries implementing DBSCAN algorithm (most of them only support d <= 3) and don't run fast at all.
I found a newer Java implementation of DBSCAN in SMILE library. We can try and port or to Spark I presume.