linfa icon indicating copy to clipboard operation
linfa copied to clipboard

Reactivate appx DBSCAN impl with new disjoint set dependency

Open jogru0 opened this issue 2 years ago • 4 comments

Hi everyone!

I saw the issues with the partitions dependency described in #295 and #299, and finally the removal of the appx DBSCAN in #298.

Coincidentally, I recently was looking for a disjoint set crate to use for myself, and when I wasn't satisfied with what I found on crates.io (all existing ones seem to be not maintained and/or not documented/not tested), I decided to publish my own: https://crates.io/crates/disjoint.

So it turned out there were just some small additional features necessary in my crate to enable it as a possible replacement for partitions for you. I implemented these now, and therefore can now provide this pull request to reactivate your appx DBSCAN.

I split the implementation in two commits:

  1. Reverted #298. This just reintroduces the appx DBSCAN impl without any further changes.
  2. Replaced dependencies: partitions -> disjoint.

If you just look at the second commit, you can see that it's really almost a drop-in replacement. Only a few lines change due to the slightly different interface.

jogru0 avatar Apr 19 '23 07:04 jogru0

Any chance you're willing to run/post some benchmarks? I believe part of the rationale for removing Appx. DBSCAN, beyond the dependency issue, was that it didn't really add any new capability since the implementation itself was actually slower than just standard DBSCAN, leaving little practical reason why someone would choose to use it. @YuhanLiin, any thoughts?

quietlychris avatar Apr 24 '23 12:04 quietlychris

So, I benchmarked my disjoint set implementation itself, as I wanted it to be not slower than existing alternatives, and I wanted to choose implementation details to optimize for speed.

For that, I looked at how fast Kruskal's algorithm runs on an example graph with 1 million vertices and ~8.8 million edges, already sorted by weight, with measures including creating the disjoint set data structure and allocating memory for the result Vec on demand.

The imo most fair comparison to the partition crate is to use DisjointSetVec<()> for that, even though you could use DisjointSet directly (which turns out to not even be faster as Rust handles Vec<T> with size_of<T>() == 0 very efficiently), and to not use the Quick-Union variant of the algorithm, as that's not possible with partition.

With these restrictions, on my local AMD Ryzen 5 3600, I get the following results:

  • partition: 290ms +/- 10ms
  • disjoint: 57ms +/- 1ms

That being said, of course this is a very specific benchmark, and I assume Appx. DBSCAN is a very different algorithm, so this doesn't necessarily mean anything for the performance of that. I don't know anything about what Appx. DBSCAN even is, to be honest (I just drop-in replaced the old dependency with a new one), but my gut feeling would be that the performance of the disjoint set data structure doesn't even matter that much here, as there is so much potentially slow other stuff happening. But I really don't know and probably someone with more knowledge about clusting algorithms should do benchmarks for that, if you are interested in that.

jogru0 avatar Apr 25 '23 07:04 jogru0

For a quick comparison can you run the normal DBSCAN and Approx DBSCAN benchmarks and post the results? If Approx is slower than we should just get rid of it.

YuhanLiin avatar Apr 26 '23 01:04 YuhanLiin

Codecov Report

Patch coverage: 72.72% and project coverage change: +1.41 :tada:

Comparison is base (4adece5) 37.08% compared to head (8091015) 38.50%.

:mega: This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #304      +/-   ##
==========================================
+ Coverage   37.08%   38.50%   +1.41%     
==========================================
  Files          94       94              
  Lines        6196     6218      +22     
==========================================
+ Hits         2298     2394      +96     
+ Misses       3898     3824      -74     
Impacted Files Coverage Δ
...infa-clustering/src/appx_dbscan/cells_grid/cell.rs 45.90% <ø> (+45.90%) :arrow_up:
...linfa-clustering/src/appx_dbscan/cells_grid/mod.rs 55.35% <60.00%> (+55.35%) :arrow_up:
...linfa-clustering/src/appx_dbscan/clustering/mod.rs 48.48% <83.33%> (+48.48%) :arrow_up:

... and 50 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

codecov-commenter avatar Apr 26 '23 02:04 codecov-commenter