mpl icon indicating copy to clipboard operation
mpl copied to clipboard

concurrency bug in path compression for hierarchical heaps

Open shwestrick opened this issue 1 month ago • 8 comments

(Known about this for a while but forgot to raise an issue. Trying to be better about bug tracking...)

Our implementation of path compression in the hierarchical heap union-find tree is not safe for concurrency. See Concurrent Disjoint Set Union by Jayanti and Tarjan for discussion of correct implementations. It shouldn't be too difficult to update our code by directly implementing their algorithm.

Our (buggy) implementation is here. To understand what's going on, I recommend reading this (a rare case of actual MaPLe documentation!)

One might wonder why this bug isn't causing terrible things to happen. My intuition: (1) for the most part, the union-find heap representative structures in MaPLe are almost always fully compressed, and (2) concurrent heap queries within the same heap are fairly rare. On the first part, note that (for example) at every LGC when new heaps are constructed, the fresh heaps are initialized with compressed paths. It might be helpful to get a better grasp on this with actual measurements. Perhaps we could contrive a stress test that actually triggers the bug.

shwestrick avatar Nov 21 '25 22:11 shwestrick

Maybe the simplest thing here would be to recreate the issue in a standalone unit test of this file?

Which operations are required to be safe for concurrent execution? Is it something like: each HM_chunk is owned by one thread, but the HM_HierarchicalHeap is shared? Would it be safe if there were only calls to HM_getLevelHead and not HM_getLevelHeadPathCompress?

pscollins avatar Nov 25 '25 22:11 pscollins

It would be really useful to try and actually trigger the bug!

Which operations are required to be safe for concurrent execution? Is it something like: each HM_chunk is owned by one thread, but the HM_HierarchicalHeap is shared?

The ownership of the HM_chunk and HM_HierarchicalHeap objects is a bit nuanced; ownership changes depending on the current state of the program.

There's a figure in my thesis which is helpful:

Image

The parent/child relationships determine who owns which hierarchical heap object, and which thread is permitted to access which chunk. These parent/child relationships change over time due to scheduler actions (stealing tasks, etc.)

Would it be safe if there were only calls to HM_getLevelHead and not HM_getLevelHeadPathCompress?

Yes, I believe so, but potentially at a performance overhead due to longer heap-membership queries.

One more important function to be aware of: linkInto performs the actual union-find linking, specifically this line: https://github.com/MPLLang/mpl/blob/a71659b091661ec6da14d6d155ea01f0ed4df336/runtime/gc/hierarchical-heap.c#L1379

So, actually, linking is always in a single direction: always right into left. This differs from typical union-find, where linking direction is chosen according to size or rank, to help ensure that paths stay small.

Perhaps this is another reason why this bug hasn't caused everything to crash and burn... always linking in the same direction would avoid the issue of accidentally creating a cycle in the union-find structure.

(My understanding with link-by-size or link-by-rank is that concurrent linkings can create cycles if you are not careful.)

shwestrick avatar Nov 26 '25 14:11 shwestrick

This raises another question, about whether or not we currently have a performance problem due to the linking strategy.

shwestrick avatar Nov 26 '25 14:11 shwestrick

(If you are interested in exploring this, a good place to start might be to add instrumentation to HM_HH_getLevelHead and HM_HH_getLevelHeadPathCompress, e.g., to count how many calls there are and how many edges in the union-find structure are being traversed.)

shwestrick avatar Nov 26 '25 14:11 shwestrick

I was hoping that I could flag this with TSAN (and so mostly-mechanically surface all of the concurrent accesses that need to be synchronized), since IIUC, somewhat-distinctly from the algorithmic issue, there's a data race in:

    HM_UnionFindNode representative = cursor->representative;
    cursor->representative = topNode;

because we have multiple threads reading+writing this representative field in parallel without atomics or a mutex, and so it's undefined behavior. I think I was able to get things to build with tsan enabled with a local hack:

diff --git a/runtime/Makefile b/runtime/Makefile
index bef2658d9..2d2985956 100644
--- a/runtime/Makefile
+++ b/runtime/Makefile
@@ -64,7 +64,7 @@ XMFLAGS :=
 XCPPFLAGS :=
 XCFLAGS := -fno-common -pedantic -Wall -Wextra
 OPTXCFLAGS := -Wdisabled-optimization -O2
-DBGXCFLAGS := -g -DASSERT=1 -Wno-uninitialized -O0
+DBGXCFLAGS := -g -DASSERT=1 -Wno-uninitialized -O0 -fsanitize=thread
 TRACEXCFLAGS := -DENABLE_TRACING $(OPTXCFLAGS)
 DETECTXCFLAGS := -DDETECT_ENTANGLEMENT $(OPTXCFLAGS)
 DBGDETECTXCFLAGS := -DDETECT_ENTANGLEMENT $(DBGXCFLAGS)
diff --git a/runtime/gc/decheck.c b/runtime/gc/decheck.c
index 4dff5dfe3..1b0f7d1d3 100644
--- a/runtime/gc/decheck.c
+++ b/runtime/gc/decheck.c
@@ -34,12 +34,15 @@ bool GC_HH_decheckMaxDepth(ARG_USED_FOR_DETECT_ENTANGLEMENT objptr resultRef) {
 #ifdef DETECT_ENTANGLEMENT
 void decheckInit(GC_state s) {
 #if ASSERT
-  if (mmap(SYNCH_DEPTHS_BASE, SYNCH_DEPTHS_LEN, PROT_WRITE,
-          MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, 0, 0) == MAP_FAILED) {
-      perror("mmap error");
-      exit(-1);
-  }
-  memset(synch_depths, 0, MAX_PATHS * sizeof(uint32_t));
+  /* if (mmap(0, SYNCH_DEPTHS_LEN, PROT_WRITE, */
+  /*          MAP_PRIVATE | MAP_ANONYMOUS, 0, 0) == MAP_FAILED) { */
+  /*     perror("mmap error"); */
+  /*     exit(-1); */
+  /* } */
+  /* memset(synch_depths, 0, MAX_PATHS * sizeof(uint32_t)); */
+  synch_depths = calloc(MAX_PATHS, sizeof(uint32_t));
   synch_depths[1] = 0;
 #endif

(because the mmap for some reason interacts badly with TSAN), and then compiling some of the examples with e.g.:

$ mkdir tmp_output
$ cd tmp_output
$ ../build/bin/mpl -debug true -debug-runtime true -keep g -cc clang -cc-opt '-fsanitize=thread -g2' -link-opt '-fsanitize=thread' -default-type int64 -default-type word64 -verbose 3  -output out.bin ../examples/src/ray/fib.mlb 2&> out.log && ./out.bin &> /tmp/run.out

which runs but fails with:

ERROR  [P00|gc/thread.c:427]: GC_HH_getNextPromotionTokenPolicy failed!

and no TSAN-reported errors.

It looks like this requires a more-involved understanding of the runtime than I have for now -- let me try to take a look at https://github.com/MPLLang/mpl/issues/226 first and then circle back to this

pscollins avatar Nov 26 '25 16:11 pscollins

I was hoping that I could flag this with TSAN (and so mostly-mechanically surface all of the concurrent accesses that need to be synchronized), since IIUC, somewhat-distinctly from the algorithmic issue, there's a data race in:

    HM_UnionFindNode representative = cursor->representative;
    cursor->representative = topNode;

because we have multiple threads reading+writing this representative field in parallel without atomics or a mutex, and so it's undefined behavior.

Yes, I agree, there's a data race here.

We've historically been a bit sloppy about this in the implementation of the run-time system. (I've been meaning to get around to cleaning it up... if only there was more time in the day.)

I think I was able to get things to build with tsan enabled with a local hack ...

I tried reproducing this, and actually with a slight change I was able to get TSAN to run. When building the program I see you're passing -cc clang, and it occurred to me that perhaps we need to use exactly the same C compiler when building the run-time system as well?

This seemed to work for me on a fresh checkout of mpl main:

$ ... # apply your diff
$ make CC=clang  # make sure build scripts use clang to build the runtime
$ build/bin/mpl -debug true -debug-runtime true -keep g -cc clang -cc-opt '-fsanitize=thread -g2' -link-opt '-fsanitize=thread' -default-type int64 -default-type word64 -verbose 3  -output out.bin examples/src/fib/sources.mlb
$ ./out.bin @mpl procs 2 --
... lots of TSAN output ...

shwestrick avatar Nov 26 '25 21:11 shwestrick

Neat! I see it finish normally without @mpl procs 2, also, so I guess GC_HH_getNextPromotionTokenPolicy failed! was caused by the mix of gcc and clang.

It flags a bunch of small things that look like they just need whatever the C equivalent of std::atomic is, e.g.:

==================
WARNING: ThreadSanitizer: data race (pid=406562)
  Write of size 1 at 0x555556acc270 by main thread:
    #0 Proc_signalInitialization /home/patrick/code/mpl/runtime/./gc/processor.c:35:18 (out.bin+0x1577ae) (BuildId: 6cfac21f545d85c44eb2dd7afdfcb2a3f5802f9e)
    #1 Parallel_init /home/patrick/code/mpl/runtime/./gc/parallel.c:10:5 (out.bin+0x17b438) (BuildId: 6cfac21f545d85c44eb2dd7afdfcb2a3f5802f9e)
    #2 Chunk_2 /home/patrick/code/mpl/examples/temp_build/out.bin.0.c:38041:2 (out.bin+0x12ab55) (BuildId: 6cfac21f545d85c44eb2dd7afdfcb2a3f5802f9e)
    #3 MLton_trampoline /home/patrick/code/mpl/build/lib/mlton/include/c-main.h:38:29 (out.bin+0xf1334) (BuildId: 6cfac21f545d85c44eb2dd7afdfcb2a3f5802f9e)
    #4 MLton_threadFunc /home/patrick/code/mpl/examples/temp_build/out.bin.1.c:4069:1 (out.bin+0xf1334)
    #5 MLton_main /home/patrick/code/mpl/examples/temp_build/out.bin.1.c:4069:1 (out.bin+0xf1de9) (BuildId: 6cfac21f545d85c44eb2dd7afdfcb2a3f5802f9e)
    #6 main /home/patrick/code/mpl/examples/temp_build/out.bin.1.c:4070:45 (out.bin+0xf1fab) (BuildId: 6cfac21f545d85c44eb2dd7afdfcb2a3f5802f9e)

  Previous read of size 1 at 0x555556acc270 by thread T1:
    #0 Proc_waitForInitialization /home/patrick/code/mpl/runtime/./gc/processor.c:22:11 (out.bin+0x157689) (BuildId: 6cfac21f545d85c44eb2dd7afdfcb2a3f5802f9e)
    #1 MLton_threadFunc /home/patrick/code/mpl/examples/temp_build/out.bin.1.c:4069:1 (out.bin+0xf12de) (BuildId: 6cfac21f545d85c44eb2dd7afdfcb2a3f5802f9e)

  Location is global 'Proc_beginInit' of size 1 at 0x555556acc270 (out.bin+0x1578270)

There are also some things that look like TSAN is confused, e.g. when compiling the runtime:

hdir-dbgdetect-pie.o basis/Posix/FileSys/chdir.c                                                                                                                                              
In file included from gc.c:111:                                                                                                                                                               
gc/switch-thread.c: In function ‘switchToThread’:                                                                                                                                             
gc/switch-thread.c:26:3: warning: ‘atomic_thread_fence’ is not supported with ‘-fsanitize=thread’ [-Wtsan]                                                                                    
   26 |   __atomic_thread_fence(__ATOMIC_SEQ_CST);                                                                                                                                            
      |   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~        

and it looks like it doesn't believe that whatever this is is actually atomic:

    #0 atomicLoadS32 /home/patrick/code/mpl/runtime/./platform/atomics-gcc-lt48.h:49:1 (out.bin+0x1837c3) (BuildId: 6cfac21f545d85c44eb2dd7afdfcb2a3f5802f9e)

I've only ever used it with the standard C++ atomics so I don't know if that's a TSAN limitation or a true positive in terms of the C memory model. It also flags some of the HM_getLevelHeadPathCompress calls, as desired:

==================
WARNING: ThreadSanitizer: data race (pid=406562)
  Read of size 8 at 0x7ffff4b91000 by thread T1:
    #0 HM_getLevelHead /home/patrick/code/mpl/runtime/./gc/chunk.c:508:3 (out.bin+0x13fcd8) (BuildId: 6cfac21f545d85c44eb2dd7afdfcb2a3f5802f9e)
    #1 HM_getLevelHeadPathCompress /home/patrick/code/mpl/runtime/./gc/chunk.c:519:35 (out.bin+0x1488c8) (BuildId: 6cfac21f545d85c44eb2dd7afdfcb2a3f5802f9e)
...

  Previous write of size 8 at 0x7ffff4b91000 by main thread:
    #0 allocateInSuperBlock /home/patrick/code/mpl/runtime/./gc/block-allocator.c:272:23 (out.bin+0x18971a) (BuildId: 6cfac21f545d85c44eb2dd7afdfcb2a3f5802f9e)
    #1 tryAllocateAndAdjustSuperBlocks /home/patrick/code/mpl/runtime/./gc/block-allocator.c:350:19 (out.bin+0x143647) (BuildId: 6cfac21f545d85c44eb2dd7afdfcb2a3f5802f9e)
    #2 allocateBlocksWithPurpose /home/patrick/code/mpl/runtime/./gc/block-allocator.c:554:19 (out.bin+0x142669) (BuildId: 6cfac21f545d85c44eb2dd7afdfcb2a3f5802f9e)
    #3 HM_getFreeChunkWithPurpose /home/patrick/code/mpl/runtime/./gc/chunk.c:126:18 (out.bin+0x146b7f) (BuildId: 6cfac21f545d85c44eb2dd7afdfcb2a3f5802f9e)
...
SUMMARY: ThreadSanitizer: data race /home/patrick/code/mpl/runtime/./gc/chunk.c:508:3 in HM_getLevelHead
==================

It looks like it's not a ton of work to go through and resolve all of the warnings -- I'd probably start with that just so that I can rely on it to catch cases I'm not aware of. I don't see it flagging any read-after-write races between concurrent calls of HM_getLevelHeadPathCompress, for example, which is what I was expecting from looking at the function in isolation

pscollins avatar Nov 27 '25 15:11 pscollins

Sounds great!

Years back, there was an initial effort to add atomics within the runtime: see e.g. atomics-gcc-gte48.h, included here. I'm not sure if this is the "right way" to define atomics in C. It would be great to replace these with definitions that play well with tools like TSAN (and then consistently use the new definitions).

shwestrick avatar Nov 28 '25 16:11 shwestrick