pycbc PyCBC threshold code is broken on intel mac with openMP

See discussion here https://github.com/conda-forge/pycbc-feedstock/pull/84. There is some issue with the openmp code used for the threshold code, which appears to manifest on old intel macs. This is covered (thankfully) by our unit tests, but it is possible that whatever is causing this will appear in other contexts in the future. We should understand and fix ... But we need to be able to reproduce it first!

Jul 29 '24 21:07 spxiwh

I unfortunately don't have time to really look into this, but I will record here my initial suggestion, especially as I think we now know that this only occurs with OpenMP enabled.

If someone can reproduced this, I would recommend seeing if we can change these lines to instead read:

            #pragma omp ordered
            {
                t+=c;
            }
            #pragma omp ordered
            {
                memmove(outl+t-c, outl+start, sizeof(unsigned int)*c);
                memmove(outv+t-c, outv+start, sizeof(std::complex<float>)*c);
            }

That is, enure that the memmove are performed inside a barrier that enforces they happen sequentially. It's just a suggestion, but if anyone ends up able to reproduce the problem, it should be easy to test if that fixes it. But note that first you also have to ensure that the total number of all points above threshhold (t) has been found.

Jul 29 '24 21:07 josh-willis

Let me ping @duncanmmacleod on this as well, as this will affect conda builds until this is resolved.

Aug 02 '24 10:08 spxiwh

I've been trying to reproduce this on my ARM mac, using it's builtin x86_64 emulation. I could not reproduce the failure.

One question: @duncanmmacleod in your conda patch, you only include the omp header files in the compilation. You do not add -fopenmp to the linking stage back in setup.py, which I think is needed to actually use openmp (at least the .so files produced are not linked to libomp unless I add this ... But I don't see the failrues in any case).

Noting that this is x86_64 bit compiled:

(intel_env) [iwharry@harrymac events]$ file simd_threshold_cython.cpython-312-darwin.so 
simd_threshold_cython.cpython-312-darwin.so: Mach-O 64-bit bundle x86_64

Aug 02 '24 11:08 spxiwh

@duncanmmacleod This is possibly version dependent (OMP version, compiler version etc.). You said you had reproduced this on an old mac, is there anything I can use from that to reproduce here? (Is this a fully isolated conda env for e.g., could I have the package list ... If using compilers installed system-wide, what versions are they? )

Aug 02 '24 11:08 spxiwh

@duncanmmacleod Just a poke to ask again if you have any more details on how you reproduced this? I've been unable to test Josh's proposed fix, without being able to reproduce the error!

Aug 23 '24 11:08 spxiwh

I have an Intel Mac, I am testing to see if I can reproduce the problem. I am having trouble compiling/linking against OpenMP however:

clang: error: unsupported option '-fopenmp'

and I seem to have

$ clang --version
Apple clang version 11.0.0 (clang-1100.0.33.16)
Target: x86_64-apple-darwin23.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

[some minutes later]

Ok, by installing a local version of clang like so:

conda install clang_osx-64 clangxx_osx-64

I got PyCBC to compile and fail the test!

$ clang --version              
clang version 14.0.6
Target: x86_64-apple-darwin23.6.0
Thread model: posix
InstalledDir: /Users/tito/miniconda3/envs/pycbc-dev/bin
$ python test/test_threshold.py 
========================================================================
Running CPU unit tests for Threshold:
test_threshold (__main__.TestThreshold.test_threshold) ... 14215 14215
FAIL

======================================================================
FAIL: test_threshold (__main__.TestThreshold.test_threshold)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/tito/Documents/projects/pycbc/pycbc/test/test_threshold.py", line 56, in test_threshold
    self.assertTrue((vals == self.vals).all())
AssertionError: False is not true

----------------------------------------------------------------------
Ran 1 test in 0.054s

FAILED (failures=1)

[some minutes later]

After some extra prints I get that locs is identical, but vals has 4281 differences out of 14026. The differences happen because the SIMD version returns vals = 0 when loc happens to be in two ranges, [65545, 131031] and [589840, 655289] (though I imagine the change actually happens at power of 2 boundaries).

After some more investigation with several runs (the random seed changes each time), it seems that the discrepancies happen within ranges of loc of length 65536, and they only happen for the "odd blocks":

loc 65442 equal, loc 65549 different: 65536 = 65536 * 1 lies in between
loc 131064 different, loc 131248 equal: 131072 = 65536 * 2 lies in between
loc 196546 equal, loc 196645 different: 196608 = 65536 * 3 lies in between
loc 261930 different, loc 262146 equal: 262144 = 65536 * 4 lies in between
loc 327564 equal, loc 327847 different: 327680 = 65536 * 5 lies in between
loc 393192 different, loc 393341 equal: 393216 = 65536 * 6 lies in between
loc 458694 equal, loc 459019 different: 458752 = 65536 * 7 lies in between
loc 524226 different, loc 524342 equal: 524288 = 65536 * 8 lies in between
loc 589774 equal, loc 589870 different: 589824 = 65536 * 9 lies in between
loc 655345 different, loc 655370 equal: 655360 = 65536 * 10 lies in between

Not sure what this is telling me without going through the code.

Perhaps this way of visualizing it is a bit clearer:

locs in [0, 65536) are all equal (961 values)
locs in [65536, 131072) are all different (876 values)
locs in [131072, 196608) are all equal (855 values)
locs in [196608, 262144) are all different (848 values)
locs in [262144, 327680) are all equal (920 values)
locs in [327680, 393216) are all equal (830 values)
locs in [393216, 458752) are all equal (849 values)
locs in [458752, 524288) are all different (926 values)
locs in [524288, 589824) are all equal (862 values)
locs in [589824, 655360) are all different (847 values)
locs in [655360, 720896) are all equal (885 values)
locs in [720896, 786432) are all equal (853 values)
locs in [786432, 851968) are all equal (908 values)
locs in [851968, 917504) are all different (839 values)
locs in [917504, 983040) are all equal (833 values)
locs in [983040, 1048576) are all equal (889 values)

and apparently the odd/even alternance is not repeatable, here is another run:

locs in [0, 65536) are all equal (859 values)
locs in [65536, 131072) are all different (861 values)
locs in [131072, 196608) are all equal (872 values)
locs in [196608, 262144) are all different (904 values)
locs in [262144, 327680) are all equal (869 values)
locs in [327680, 393216) are all different (881 values)
locs in [393216, 458752) are all equal (929 values)
locs in [458752, 524288) are all equal (909 values)
locs in [524288, 589824) are all equal (865 values)
locs in [589824, 655360) are all equal (876 values)
locs in [655360, 720896) are all equal (913 values)
locs in [720896, 786432) are all different (887 values)
locs in [786432, 851968) are all equal (883 values)
locs in [851968, 917504) are all different (892 values)
locs in [917504, 983040) are all equal (810 values)
locs in [983040, 1048576) are all equal (862 values)

So it seems that the discrepancy is confined to ranges in locs that have length 65536, and inside each range either all are consistent or all are different, but whether a given block is consistent or different is random.

I am going to stop here for today, but I can do more tests if needed.

Aug 23 '24 11:08 titodalcanton

@titodalcanton Sorry, I missed that you had managed to reproduce this! (Strange that my ARM mac, forced to run x86 instructions through rosetta, did not manage to reproduce this).

Can you test if Josh's proposed fix above resolves the issue?

Nov 11 '24 10:11 spxiwh

(Ah, github doesn't send notifications on edits, that's why I missed this)

Nov 11 '24 10:11 spxiwh

Sorry, apparently I missed that @josh-willis had a suggestion.

Why do we need two separate #pragma directives? The compiler does not seem to like it:

      In file included from pycbc/events/simd_threshold_cython.cpp:1286:
      pycbc/events/simd_threshold_ccode.cpp:95:13: error: exactly one 'ordered' directive must appear in the loop body of an enclosing directive
                  #pragma omp ordered
                  ^
      pycbc/events/simd_threshold_ccode.cpp:91:13: note: previous 'ordered' directive used here
                  #pragma omp ordered

However if I do this:

diff --git a/pycbc/events/simd_threshold_ccode.cpp b/pycbc/events/simd_threshold_ccode.cpp
index ae5910e5e..51ea2a931 100644
--- a/pycbc/events/simd_threshold_ccode.cpp
+++ b/pycbc/events/simd_threshold_ccode.cpp
@@ -91,10 +91,9 @@ void _parallel_threshold(int64_t N, std::complex<float> * __restrict arr,
             #pragma omp ordered
             {
                 t+=c;
+                memmove(outl+t-c, outl+start, sizeof(unsigned int)*c);
+                memmove(outv+t-c, outv+start, sizeof(std::complex<float>)*c);
             }
-            memmove(outl+t-c, outl+start, sizeof(unsigned int)*c);
-            memmove(outv+t-c, outv+start, sizeof(std::complex<float>)*c);
-
         }
 
         count[0] = t;

Then indeed the test passes:

$ python test/test_threshold.py                    
========================================================================
Running CPU unit tests for Threshold:
test_threshold (__main__.TestThreshold.test_threshold) ... Reference: 14327 locs, 14327 vals
Test: 14327 locs (0 different), 14327 vals (0 different)
locs in [0, 65536) are all equal (916 values)
locs in [65536, 131072) are all equal (862 values)
locs in [131072, 196608) are all equal (916 values)
locs in [196608, 262144) are all equal (858 values)
locs in [262144, 327680) are all equal (926 values)
locs in [327680, 393216) are all equal (917 values)
locs in [393216, 458752) are all equal (874 values)
locs in [458752, 524288) are all equal (864 values)
locs in [524288, 589824) are all equal (924 values)
locs in [589824, 655360) are all equal (919 values)
locs in [655360, 720896) are all equal (874 values)
locs in [720896, 786432) are all equal (875 values)
locs in [786432, 851968) are all equal (887 values)
locs in [851968, 917504) are all equal (874 values)
locs in [917504, 983040) are all equal (924 values)
locs in [983040, 1048576) are all equal (917 values)
ok

----------------------------------------------------------------------
Ran 1 test in 0.133s

OK

Nov 11 '24 14:11 titodalcanton

Back from reading up on what this is actually doing. Yeah, indeed we can't have two directives. I'm not sure what the difference in practice between:

            #pragma omp ordered
            {
                t+=c;
            }

            memmove(outl+t-c, outl+start, sizeof(unsigned int)*c);
            memmove(outv+t-c, outv+start, sizeof(std::complex<float>)*c);

and

            #pragma omp ordered
            {
                t+=c;
                memmove(outl+t-c, outl+start, sizeof(unsigned int)*c);
                memmove(outv+t-c, outv+start, sizeof(std::complex<float>)*c);
            }

is (and clearly it is somehow dependent on something deep). The latter would seem to be "compute these useful inputs in parallel and then do all these memmoves in serial". The former does the t +=c in serial, but what happens next is unclear, does it start running the final memmoves in parallel even when other threads are still updating t? Maybe that's somewhat ambiguous in this case and then mac intel is treating this differently than others ...

Nov 11 '24 15:11 spxiwh

So after doing some reading of my own, I do think it makes sense that this works:

       #pragma omp ordered
            {
                t+=c;
                memmove(outl+t-c, outl+start, sizeof(unsigned int)*c);
                memmove(outv+t-c, outv+start, sizeof(std::complex<float>)*c);
            }

Since from the documentation I could find, ordered will ensure everything in this block is executed in the same order as in a sequential pass through the parallel section, in this case the for loop. But I still find it confusingly written, since the loop iteration variable i does not appear anywhere in the ordered block, so it is not intuitive in what order this will be carried out.

It does, however, matter that the memmoves are done in the correct order, since they are overwriting the earlier parts of the array, therefore it's correct to move these inside the ordered block. I think earlier I had been concerned it was just a barrier, and so we would need to allow all of the parallel sections to add their individual totals into t before it was safe to let any of the memmoves happen. But since it is not just a barrier but enforces sequence order, @titodalcanton 's fix should be correct.

My guess is that it is compiler-dependent what happens when you put the memmove statements outside the ordered block, as we have seen. Indeed, I don't think it's safe for the outv array any more than the outl array, though on the Intel Macs only the latter seems to experience the problem.

Nov 11 '24 19:11 josh-willis