numpy MAINT: Use fused-multiply-add for complex numbers calculations

Background:

Depending on the compiler, plain C code may either use fused-multiply-add or not. For example numpy-1.26.4 on macOS did not use fma
However many calculations in numpy are not plain C code and use SIMD intrinsics, and deliberately use fma without depending on the compiler to choose that
When fma is not used in plain C loops there are inconsistencies in numpy. Sometimes even the exact same deterministic code executed twice in the same program may produce different results each time (#26940)

This PR makes basic calculations for complex numbers always use fused-multiply-add, using the fma functions from <math.h>, and it also resolves an additional minor off-by-one bug which contributed to the inconsistency #26940

Jul 16 '24 11:07 yairchu

Looking at the CI it appears that on some environments fma doesn't really perform fma and falls back to non-fused? (https://github.com/numpy/numpy/actions/runs/9955770700/job/27504241438) And so my new tests fail in those.

Would it be appropriate to skip the new test on such platforms? If so, what is the best way of doing that?

Jul 16 '24 11:07 yairchu

Looking at the CI it appears that on some environments fma doesn't really perform fma and falls back to non-fused? (https://github.com/numpy/numpy/actions/runs/9955770700/job/27504241438) And so my new tests fail in those.

Would it be appropriate to skip the new test on such platforms? If so, what is the best way of doing that?

I think that'd be the appropriate action to take. The baseline build of NumPy targets the minimum supported processor, so may not have FMA instructions. We detect if we can run a more optimal version and dynamically dispatch it.

If no hardware instruction is available, you might be able to use np.__config__.__cpu_baseline__ or np.__config__.__cpu_features__ to gate the test case if there's no instruction available on x86?

Jul 17 '24 09:07 Mousius

Would it be appropriate to skip the new test on such platforms? If so, what is the best way of doing that?

I don't know. Since this seems to be distinct from capabilities (if the test fails, the SIMD code is used and uses FMA), so it is a compiler decision to not fuse it for speed probably.

We should be able to ensure FMA is used if SIMD uses it, I guess speed may be mostly irrelevant (the slow loop is relatively rarely never used), but dunno. A best effort to just write fma and let the compiler do whatever it likes to do could be done, but doesn't give any promises to you as a user.

Jul 17 '24 10:07 seberg

You're right that no complete cross-platform promise is gained by best effort to use fma, but I suppose that having the tests explicitly list for which platforms there isn't a promise will at least make this list explicit, and will guard against regressing in the other platforms?

As for the option of using SIMD code for scalar cases - wouldn't that make things slower?

Jul 17 '24 20:07 yairchu

Looking at the CI it appears that on some environments fma doesn't really perform fma and falls back to non-fused?

My understanding was that fma and fmaf should compute an accurate results irrespective of what hardware it is run on. From https://en.cppreference.com/w/c/numeric/math/fma Computes (x * y) + z as if to infinite precision and rounded only once to fit the result type. Shouldn't that mean the test should pass everywhere? Curious to know why that CI failed.

As for the option of using SIMD code for scalar cases - wouldn't that make things slower?

It might be slower, but I think computing the value for a scalar would hardly show up on perf let alone be a bottleneck for a program (unless someone is looping over an array, in which case I argue they are using array processing incorrectly).

Jul 18 '24 20:07 r-devulap

@r-devulap

My understanding was that fma and fmaf should compute an accurate results irrespective of what hardware it is run on. From https://en.cppreference.com/w/c/numeric/math/fma Computes (x * y) + z as if to infinite precision and rounded only once to fit the result type. Shouldn't that mean the test should pass everywhere? Curious to know why that CI failed.

If that's the case then perhaps those CI tests failed due to the scalar version being more accurate than the SIMD one!

From @seberg's comment I infer that whether fma does what cppreference says is platform dependent:

A best effort to just write fma and let the compiler do whatever it likes to do

Looking at the test output to figure out which is fma's behaviour currently doesn't help, because it just says that 2.018506e-13-2.649923e-13j isn't equal to itself.

Would it generally be a good idea for assert_equal to output at higher precision when the actual and desired are formatted the same? For now I also added another commit to do that, which may fit in another PR but included here at least to help make more sense from CI results.

Jul 19 '24 09:07 yairchu

@r-devulap

My understanding was that fma and fmaf should compute an accurate results irrespective of what hardware it is run on. From https://en.cppreference.com/w/c/numeric/math/fma Computes (x * y) + z as if to infinite precision and rounded only once to fit the result type. Shouldn't that mean the test should pass everywhere? Curious to know why that CI failed.

Thanks, @r-devulap. My assumption was that it would fall back to a naive implementation, but this indicates otherwise.

If that's the case then perhaps those CI tests failed due to the scalar version being more accurate than the SIMD one! From @seberg's comment I infer that whether fma does what cppreference says is platform dependent:

A best effort to just write fma and let the compiler do whatever it likes to do

Looking at the test output to figure out which is fma's behaviour currently doesn't help, because it just says that 2.018506e-13-2.649923e-13j isn't equal to itself.

Would it generally be a good idea for assert_equal to output at higher precision when the actual and desired are formatted the same? For now I also added another commit to do that, which may fit in another PR but included here at least to help make more sense from CI results.

Given what @r-devulap has said, I think they're both correct. I'm guessing using assert_array_max_ulp with a maxulp of 0.5 or 1 would work, as the two implementations will have slightly different roundings but still be considered correct.

Jul 20 '24 13:07 Mousius

@yairchu could you try changing some of the remaining failures on 32-bit from assert_array_almost_equal to assert_array_almost_equal_nulp?

Aug 01 '24 11:08 Mousius

@yairchu could you try changing some of the remaining failures on 32-bit from assert_array_almost_equal to assert_array_almost_equal_nulp?

@Mousius done!

The currently failing test appears to be an unrelated CI failure for test_randomstate.py. Am assuming it's not related to the complex numbers change as I also see azure-pipeline numpy.numpy actions fail for most recent commit builds too.

Aug 05 '24 20:08 yairchu

@yairchu the 32-bit test still fails.

Oct 16 '24 17:10 r-devulap