core_kernel icon indicating copy to clipboard operation
core_kernel copied to clipboard

More efficient union-find implementation using GADTs and inline records

Open mmottl opened this issue 9 years ago • 4 comments

This branch implements changes to the union-find implementation to make it more efficient by removing indirections, thus eliminating extra heap blocks and pointer dereferencing. This is achieved by combining GADTs and inline records. This trick can likely be used in many other contexts.

There are no API changes. The changes have been tested, but should also be tested with the internal Jane Street test suite.

There is currently a small bug in the beta 2 of OCaml 4.04 that prevents me from implementing the type definitions in a more straightforward way. The workaround using a type variable is only mildly more complicated. Note that OCaml 4.04, which can unbox single tag constructors, is required to see efficiency gains.

The algorithm now also does not use tail recursion in the compression loop anymore to avoid expensive list allocations. The worst case recursion depth cannot realistically blow the stack anyway due to the O(log N) bound.

mmottl avatar Oct 04 '16 19:10 mmottl

It was a little messy testing the test suite without having support for that, but the patch I just pushed should probably help. I hope the public Core_kernel will soon support building and running test suites.

mmottl avatar Oct 06 '16 16:10 mmottl

Ping. This also seems like a good improvement that we should upstream. CC @bcc32 @rnml @ceastlund

yminsky avatar Oct 10 '20 12:10 yminsky

I implemented this on our most recent version of [Core.Union_find], and ran the current version vs the proposed version against some benchmarks. In particular, our benchmarks include calling each of the functions in the interface on single nodes, as well as on multiple nodes in unions of 50 or 100 nodes to get some variation in path compression complexity.

In general, it seems the new implementation performs better on small cases and cases which don't require traversing parents (~20%-40% speed improvement), while performing worse in cases which do require traversing parents (~50%-80% speed penalty). The new implementation does reduce allocations by 2 words per node (7 -> 5), but I don't think the improvements are worth the penalties.

Do you have your own benchmarks which suggest this implementation will have better performance in cases we may have missed?

tdelvecchio-jsc avatar Apr 14 '23 15:04 tdelvecchio-jsc

It has been a long time since I opened this PR, and the compiler and runtime have gone through significant changes. As usual, your most frequent use cases will determine which implementation will work better for you. I'm a little surprised why traversing parents would be slower with this representation. I don't see any obvious reason why it should, but haven't looked into this in years. It may be worthwhile investigating why this case appears slower. Maybe a small modification of the code could improve performance for that case. Sometimes even simple things like changing the order of variants in the type or of match cases might make a difference. The representation would otherwise seem quite optimal to me.

mmottl avatar Apr 14 '23 15:04 mmottl