OpenBLAS Precision issues on POWER related to SWITCH

I'm debugging a precision issue in PyTorch which I traced back to parallelization done in OpenBLAS. I.e. forcing OpenBLAS to use the serial algorithm (either by hacking the source or by OMP_NUM_THREADS=1) makes it output the "correct" results.

I only see the issue on POWER when using a certain number of threads, which could be as low as 12. On x86 I was unable to reproduce it.

Digging deeper I've seen sgemm_thread_tn being invoked which splits the matmul based on some constant SWITCH_RATIO which is set to 32 on the x86 test node but unset on POWER making it default to 2.

Forcing that constant to the x86 value of 32 I'm seeing the exact same results on POWER and x86. So I think this is something that should be adressed here.

For reference the issue I'm working on is https://github.com/pytorch/pytorch/issues/41469 and the differences are Power: -1342.2568359375 == -1342.3068847656, x86: -1342.3491210938 == -1342.3471679688

Oct 19 '20 16:10 Flamefire

By increasing SWITCH_RATIO you are just moving the point at which the algorithm switches from single to multithreaded operation (number of rows or columns greater than SWITCH_RATIO times the number of threads available, as OpenBLAS currently does no gradual increase in thread number). Fundamentally this is probably some kind of FMA rounding difference that varies with the ordering of operations. (Aside from that it could be worthwile to tune this value for performance reasons - however the value of 32 is/was used only on Haswell and SkylakeX, and in recent versions got reduced to 4 or 8 depending on data type)

Oct 19 '20 20:10 martin-frbg

This doesn't seem to be the full story. It seems to also affect the number of actively used threads. In particular on a 256 thread x86 node the 2x128 (n*m) matmul gets divided to 1x8 threads (using a SWITCH_RATIO of 16) while on a 176 thread POWER node the same matmul gets divided to 1x44 threads (SWITCH_RATIO of 2) I can verify this by printing the RangeM/N arrays which for POWER contain only 3s and 2s.

IMO this is relevant from a performance view too: IIUC the RangeM/N means that each thread will process a 2x3 subchunk in this case which might be bad for performance.

Oh and it seems to be even worse: With 176 threads it gets divided into 11 chunks of ~12 while with 128 threads there will be 16 chunks of 8, both using a fixed SWITCH_RATIO of 8 (using 0.3.7, so had to patch that new stuff in). So using more threads leads to less chunks. And strangely the one with less chunks fails the test.

And I did a final experiment: I hardcoded the number of chunks in n x m to 1x11 and ran the same on POWER and x86 and getting slightly different results. Changing the args -> nthreads = nthreads_m * nthreads_n; line to args -> nthreads = 1; I'm now getting the same results on POWER too. IMO this is very odd as the work should now be the same. No idea what else to check...

Oct 20 '20 16:10 Flamefire

@martin-frbg Is there a commit which fixed that? I'd like to reverify this when I run into this issue again.

Aug 15 '22 08:08 Flamefire

No particular commit, but an issue raised against a now outdated version, and for something I cannot debug on the hardware accessible to me. (And something that did not attract the attention of the IBM professionals who are semi-regular contributors) My impression was that you were digging in numerical accuracy differences that are entirely explainable by differences in FMA operation sequence (and well within the expected accuracy for the data type), and the related pytorch ticket had already been closed with what looked like a conclusion of "don't do this then".

Aug 15 '22 13:08 martin-frbg

@Flamefire Do you have any openblas test to recreate this? @rafaelcfsousa tried to recreate this with PyTorch tests , but the tests are passing on POWER10. Can you also share the compiler version and system details?

Aug 17 '22 13:08 RajalakshmiSR

@Flamefire Do you have any openblas test to recreate this? @rafaelcfsousa tried to recreate this with PyTorch tests , but the tests are passing on POWER10. Can you also share the compiler version and system details?

No I don't have any. Some system information from the time where I encountered this:

PyTorch version: 1.6.0-rc3
OS: Red Hat Enterprise Linux Server release 7.6 (Maipo)
GCC 8.3.0
Python 3.7
OpenBLAS 0.3.7
FFTW 3.3.8

It is the test_conv_large_cpu which fails. They now disabled this so it doesn't get run on newer PyTorch versions.

Aug 23 '22 09:08 Flamefire

OpenBLAS 0.3.7 is old and test_conv_large_cpu seems to work now with newer versions.

test_conv_double_backward_no_bias_cpu (__main__.TestNNDeviceTypeCPU) ... ok
test_conv_double_backward_stride_cpu (__main__.TestNNDeviceTypeCPU) ... ok
test_conv_double_backward_strided_with_3D_input_and_weight_cpu (__main__.TestNNDeviceTypeCPU) ... ok

test_Conv2d_groups (__main__.TestNN) ... ok 

test_grad_conv2d_input (__main__.TestNN) ... ok

test_conv_large_cpu (__main__.TestNNDeviceTypeCPU) ... ok

Aug 23 '22 14:08 RajalakshmiSR

I did a git bisect and found that https://github.com/xianyi/OpenBLAS/pull/3062 fixed the issue by defining the SWITCH_RATIO for PPC. So this bug is fixed in 0.3.14+.

Sep 06 '22 13:09 Flamefire

Oops. This does raise the question of how to handle POWER8, as #3062 addressed only the newer systems. And if SWITCH_RATIO turns out to be important for moderately large multicore hardware in general, leaving it undefined (i.e. defaulting to 2) on all the other non-x86 hardware added since Goto's time could be questionable too. :(

Sep 06 '22 14:09 martin-frbg

That's why I brought this issue up. Maybe you can derive some test out of the pytorch code and the information here so make tests will detect this issue at least.

Sep 06 '22 14:09 Flamefire

Trying to reproduce this on POWER8 - is getting lots of "UserWarning: Using a target size (torch.Size([2, 10])) that is different to the input size (torch.Size([2, 1]))" from test_nn.py normal, expected behaviour when building from current git ?

Sep 08 '22 14:09 martin-frbg

Possibly. Haven't tested the latest PyTorch git but the tests are sometimes faulty themselves. Try to run only the one test that fails with python test_nn.py TestNNDeviceTypeCPU.test_conv_large_cpu but it likely doesn't get run on recent PyTorch, see above:

It is the test_conv_large_cpu which fails. They now disabled this so it doesn't get run on newer PyTorch versions.

Sep 09 '22 07:09 Flamefire

Thanks. Indeed that one reports "skipped=1", but removing the @onlyCUDA tags in places that look relevant does not help - but this is only an 8 vcpu, 10GB instance I am testing with so most likely underpowered for this purpose anyway.

Sep 09 '22 09:09 martin-frbg

Remove both decorators or use 1.6: https://github.com/pytorch/pytorch/blob/d802fcfcd81f86333e25135a20802aa39b275da1/test/test_nn.py#L15590-L15591

Sep 09 '22 09:09 Flamefire

Thanks. Still getting skipped=1 unless I also remove the tags from the test_conv_large in addition to test_conv_large_nosplit, at which point my remote shell gets closed (presumably running out of memory in that task).

Sep 09 '22 11:09 martin-frbg

Ups, yes I meant of the test used here, so correct would be the 3 lines at https://github.com/pytorch/pytorch/blob/d802fcfcd81f86333e25135a20802aa39b275da1/test/test_nn.py#L15817-L15819

I suspect you can recreate this on a x86 machine when build OpenBLAS with modified SWITCH_RATIO, but maybe it needs many threads/cores for this to happen.

Sep 09 '22 13:09 Flamefire

OpenBLAS
OpenBLAS copied to clipboard

Precision issues on POWER related to SWITCH_RATIO

OpenBLAS OpenBLAS copied to clipboard

Precision issues on POWER related to SWITCH_RATIO

OpenBLAS
OpenBLAS copied to clipboard