mpich
mpich copied to clipboard
thread sanitizer: data race in ch4 progress
I get a data-race warning in MPIDI_progress_test ch4_progress.h:154
when running with tsan
and MPIR_CVAR_ASYNC_PROGRESS=1
on macbook m2.
For the future tracking of the issue, here is the error returned by the sanitizer is the following:
[0] ==================
[0] WARNING: ThreadSanitizer: data race (pid=92906)
[0] Write of size 8 at 0x0001113e2c00 by thread T3[0] (mutexes:[0] write M0):
[0] #0 MPIDI_progress_test ch4_progress.h:154 (libpmpi.0.dylib:arm64+0xb386f0) (BuildId: 064b16d3179035cebeaa4a1ed671735b32000000200000000100000000000c00)
[0] #1 progress_fn init_async.c:91 (libpmpi.0.dylib:arm64+0xb3705c) (BuildId: 064b16d3179035cebeaa4a1ed671735b32000000200000000100000000000c00)
[0] #2 MPLI_thread_start <null>:143658688 (libpmpi.0.dylib:arm64+0xfb7b0c) (BuildId: 064b16d3179035cebeaa4a1ed671735b32000000200000000100000000000c00)
[0]
[0] Previous read of size 8 at 0x0001113e2c00 by main thread[0] :
[0] #0 MPIDI_progress_test ch4_progress.h:154 (libpmpi.0.dylib:arm64+0xbe8e1c) (BuildId: 064b16d3179035cebeaa4a1ed671735b32000000200000000100000000000c00)
[0] #1 MPIR_Test_state request_impl.c:312 (libpmpi.0.dylib:arm64+0xbd8074) (BuildId: 064b16d3179035cebeaa4a1ed671735b32000000200000000100000000000c00)
[0] #2 MPIR_Test request_impl.c:352 (libpmpi.0.dylib:arm64+0xbd8af4) (BuildId: 064b16d3179035cebeaa4a1ed671735b32000000200000000100000000000c00)
[0] #3 MPI_Test test.c:125 (libmpi.0.dylib:arm64+0x3c8f08) (BuildId: 3ec3c1f6855330b093421a16ab718e4a32000000200000000100000000000c00)
[0] #4 main start_pready.c:212 (start_pready:arm64+0x100002e70) (BuildId: f31015d5435f3b0e856d6068bce8419032000000200000000100000000000c00)
[0]
[0] Location is global 'MPIDI_global' at 0x0001113e1840 (libpmpi.0.dylib+0x529ac00)
[0]
[0] Mutex M0 (0x0001113e2bc0) created at:
[0] #0 pthread_mutex_init <null>:145757920 (libclang_rt.tsan_osx_dynamic.dylib:arm64+0x89c8) (BuildId: 0e75567ae34738aa8f5083868a152d0f32000000200000000100000000000b00)
[0] #1 MPID_Init ch4_init.c:474 (libpmpi.0.dylib:arm64+0xcd945c) (BuildId: 064b16d3179035cebeaa4a1ed671735b32000000200000000100000000000c00)
[0] #2 MPII_Init_thread mpir_init.c:233 (libpmpi.0.dylib:arm64+0xb332c4) (BuildId: 064b16d3179035cebeaa4a1ed671735b32000000200000000100000000000c00)
[0] #3 MPIR_Init_thread_impl mpir_init.c:349 (libpmpi.0.dylib:arm64+0xb3443c) (BuildId: 064b16d3179035cebeaa4a1ed671735b32000000200000000100000000000c00)
[0] #4 MPI_Init_thread init_thread.c:109 (libmpi.0.dylib:arm64+0xf6f9c) (BuildId: 3ec3c1f6855330b093421a16ab718e4a32000000200000000100000000000c00)
[0] #5 MTest_Init_thread <null>:143658816 (start_pready:arm64+0x100003c04) (BuildId: f31015d5435f3b0e856d6068bce8419032000000200000000100000000000c00)
[0] #6 MTest_Init <null>:143658816 (start_pready:arm64+0x1000041a8) (BuildId: f31015d5435f3b0e856d6068bce8419032000000200000000100000000000c00)
[0] #7 main start_pready.c:154 (start_pready:arm64+0x1000026ec) (BuildId: f31015d5435f3b0e856d6068bce8419032000000200000000100000000000c00)
[0]
[0] Thread T3[0] (tid=1506127, running) created by main thread[0] at:[0]
[0] #0 pthread_create <null>:145757920 (libclang_rt.tsan_osx_dynamic.dylib:arm64+0x747c) (BuildId: 0e75567ae34738aa8f5083868a152d0f32000000200000000100000000000b00)
[0] #1 MPL_thread_create <null>:143658688 (libpmpi.0.dylib:arm64+0xfb7a88) (BuildId: 064b16d3179035cebeaa4a1ed671735b32000000200000000100000000000c00)
[0] #2 MPIR_Start_progress_thread_impl init_async.c:273 (libpmpi.0.dylib:arm64+0xb36180) (BuildId: 064b16d3179035cebeaa4a1ed671735b32000000200000000100000000000c00)
[0] #3 MPII_init_async init_async.c:328 (libpmpi.0.dylib:arm64+0xb375e0) (BuildId: 064b16d3179035cebeaa4a1ed671735b32000000200000000100000000000c00)
[0] #4 MPII_Init_thread mpir_init.c:336 (libpmpi.0.dylib:arm64+0xb336b4) (BuildId: 064b16d3179035cebeaa4a1ed671735b32000000200000000100000000000c00)
[0] #5 MPIR_Init_thread_impl mpir_init.c:349 (libpmpi.0.dylib:arm64+0xb3443c) (BuildId: 064b16d3179035cebeaa4a1ed671735b32000000200000000100000000000c00)
[0] #6 MPI_Init_thread init_thread.c:109 (libmpi.0.dylib:arm64+0xf6f9c) (BuildId: 3ec3c1f6855330b093421a16ab718e4a32000000200000000100000000000c00)
[0] #7 MTest_Init_thread <null>:143658816 (start_pready:arm64+0x100003c04) (BuildId: f31015d5435f3b0e856d6068bce8419032000000200000000100000000000c00)
[0] #8 MTest_Init <null>:143658816 (start_pready:arm64+0x1000041a8) (BuildId: f31015d5435f3b0e856d6068bce8419032000000200000000100000000000c00)
[0] #9 main start_pready.c:154 (start_pready:arm64+0x1000026ec) (BuildId: f31015d5435f3b0e856d6068bce8419032000000200000000100000000000c00)
[0]
[0] SUMMARY: ThreadSanitizer: data race ch4_progress.h:154 in MPIDI_progress_test
[0] ==================
I suspect this is the recursive mutex check. The mutex owner is set/reset under lock, but it is read using a regular load, without holding the lock, in MPIDUI_THREAD_CS_ENTER
.
https://github.com/pmodels/mpich/blob/959b11dc24f9853b88725a43dcbc9ce8c716622d/src/mpid/common/thread/mpidu_thread_fallback.h#L139-L161
Well, this from the log probably invalidates my theory 🤷
[0] Location is global 'MPIDI_global' at 0x0001113e1840 (libpmpi.0.dylib+0x529ac00)
Wait, that is where the mutexes (and owners) are stored. So it could still be the case 😅.
@thomasgillis since we removed the recursive locking, could you try and see if this warning is fixed?