mne-python icon indicating copy to clipboard operation
mne-python copied to clipboard

Two test failures on ppc64le

Open musicinmybrain opened this issue 3 years ago • 12 comments
trafficstars

Describe the bug

While packaging version 1.0.3 for Fedora linux, we found that the tests test_csd_morlet and test_time_frequency fail on ppc64le.

Steps to reproduce

Run the unit tests on a ppc64le system. There is no reason to believe this error is specific to the Fedora Linux RPM build environment.

Expected results

All tests pass.

Actual results

=================================== FAILURES ===================================
_______________________________ test_csd_morlet ________________________________
mne/time_frequency/tests/test_csd.py:529: in test_csd_morlet
    assert_allclose(csd._data[[0, 3, 5]] * sfreq, power)
E   AssertionError: 
E   Not equal to tolerance rtol=1e-07, atol=0
E   
E   Mismatched elements: 9 / 9 (100%)
E   Max absolute difference: 13.4892946
E   Max relative difference: 0.99930094
E    x: array([[2.698620e+01+3.629330e-17j, 9.270776e-04+1.165640e-21j,
E           2.699776e+01-1.773874e-17j],
E          [2.698733e+01+3.677728e-17j, 1.591260e-03-2.498003e-21j,...
E    y: array([[1.350002e+01, 2.383404e-03, 1.351705e+01],
E          [1.349949e+01, 2.270551e-03, 1.603048e-04],
E          [1.349873e+01, 6.293613e-03, 1.350833e+01]])
----------------------------- Captured stdout call -----------------------------
Not setting metadata
1 matching events found
Applying baseline correction (mode: mean)
0 projection items activated
0 bad epochs dropped
Computing cross-spectral density from epochs...
[done]
Computing cross-spectral density from epochs...
[done]
Computing cross-spectral density from epochs...
[done]
Computing cross-spectral density from epochs...
[done]
Computing cross-spectral density from epochs...
[done]
----------------------------- Captured stderr call -----------------------------

  0%|          | CSD epoch blocks : 0/1 [00:00<?,       ?it/s]
100%|██████████| CSD epoch blocks : 1/1 [00:00<00:00,  850.08it/s]

  0%|          | CSD epoch blocks : 0/1 [00:00<?,       ?it/s]
100%|██████████| CSD epoch blocks : 1/1 [00:00<00:00,  911.01it/s]

  0%|          | CSD epoch blocks : 0/1 [00:00<?,       ?it/s]
100%|██████████| CSD epoch blocks : 1/1 [00:00<00:00,  745.79it/s]

  0%|          | CSD epoch blocks : 0/1 [00:00<?,       ?it/s]
100%|██████████| CSD epoch blocks : 1/1 [00:00<00:00,  900.26it/s]

  0%|          | CSD epoch blocks : 0/1 [00:00<?,       ?it/s]
100%|██████████| CSD epoch blocks : 1/1 [00:00<00:00,  887.12it/s]
_____________________________ test_time_frequency ______________________________
mne/time_frequency/tests/test_tfr.py:132: in test_time_frequency
    assert_allclose(epochs_amplitude_2.data**2, epochs_power_picks.data)
E   AssertionError: 
E   Not equal to tolerance rtol=1e-07, atol=0
E   
E   Mismatched elements: 17637 / 17640 (100%)
E   Max absolute difference: 8.4915913e-21
E   Max relative difference: 63.63766583
E    x: array([[[[6.565396e-24, 3.838394e-24, 1.699187e-24, ..., 1.025243e-22,
E             8.846176e-23, 7.457034e-23],
E            [1.024062e-22, 9.587875e-23, 8.570987e-23, ..., 6.220701e-23,...
E    y: array([[[[9.130315e-23, 9.576540e-23, 9.999026e-23, ..., 2.239139e-22,
E             2.292956e-22, 2.341922e-22],
E            [1.239185e-22, 1.324456e-22, 1.410500e-22, ..., 1.732860e-22,...
----------------------------- Captured stdout call -----------------------------
Opening raw data file /builddir/build/BUILD/mne-python-1.0.3/mne/time_frequency/tests/../../io/tests/data/test_raw.fif...
    Read a total of 3 projection items:
        PCA-v1 (1 x 102)  idle
        PCA-v2 (1 x 102)  idle
        PCA-v3 (1 x 102)  idle
    Range : 25800 ... 40199 =     42.956 ...    66.930 secs
Ready.
Not setting metadata
7 matching events found
Setting baseline interval to [-0.19979521315838786, 0.0] sec
Applying baseline correction (mode: mean)
3 projection items activated
Loading data for 7 events and 420 original time points ...
0 bad epochs dropped
Not setting metadata
7 matching events found
Setting baseline interval to [-0.19979521315838786, 0.0] sec
Applying baseline correction (mode: mean)
Created an SSP operator (subspace dimension = 3)
3 projection items activated
Loading data for 1 events and 420 original time points ...
Removing projector <Projection | PCA-v1, active : True, n_channels : 102>
Removing projector <Projection | PCA-v2, active : True, n_channels : 102>
Removing projector <Projection | PCA-v3, active : True, n_channels : 102>
Loading data for 7 events and 420 original time points ...
Loading data for 7 events and 420 original time points ...
Loading data for 7 events and 420 original time points ...
Loading data for 7 events and 420 original time points ...
0 bad epochs dropped
Loading data for 7 events and 420 original time points ...
Not setting metadata
Loading data for 7 events and 420 original time points ...
Loading data for 7 events and 420 original time points ...
Loading data for 7 events and 420 original time points ...
Not setting metadata

Additional information

Platform:         Linux-5.18.10-200.fc36.ppc64le-ppc64le-with-glibc2.35.9000
Python:           3.11.0b5 (main, Jul 26 2022, 00:00:00) [GCC 12.1.1 20220628 (Red Hat 12.1.1-3)]
Executable:       /usr/bin/python3
CPU:              ppc64le: 8 cores
Memory:           Unavailable (requires "psutil" package)
mne:              1.0.3
numpy:            1.22.0 {blas=flexiblas, lapack=flexiblas}
scipy:            1.8.1
matplotlib:       3.5.2 {backend=agg}
sklearn:          1.0.2
numba:            Not found
nibabel:          3.2.2
nilearn:          0.9.1
dipy:             1.5.0
cupy:             Not found
pandas:           1.3.5
pyvista:          Not found
pyvistaqt:        Not found
ipyvtklink:       Not found
vtk:              9.1.0
PyQt5:            5.15.6
ipympl:           Not found
pooch:            v1.5.2
mne_bids:         Not found
mne_nirs:         Not found
mne_features:     Not found
mne_qt_browser:   Not found
mne_connectivity: Not found

I am happy to run any additional tests that may be helpful, or try any candidate fixes, on real or emulated ppc64le hardware.

musicinmybrain avatar Aug 01 '22 16:08 musicinmybrain

@sanjayankur31

musicinmybrain avatar Aug 01 '22 16:08 musicinmybrain

Out of curiosity, what backend is flexiblas using? I think the default is OpenBLAS, could you run the tests with the Netlib backend?

cbrnr avatar Aug 01 '22 17:08 cbrnr

Out of curiosity, what backend is flexiblas using? I think the default is OpenBLAS, could you run the tests with the Netlib backend?

I will try it.

musicinmybrain avatar Aug 01 '22 17:08 musicinmybrain

With the Netlib backend (export FLEXIBLAS=NETLIB), the two failures reported here on ppc64le and the one failure reported in https://github.com/mne-tools/mne-python/issues/10984 on aarch64 disappear; there is, however, a new failure on x86_64:

________________________ test_make_eeg_average_ref_proj ________________________
mne/tests/test_proj.py:341: in test_make_eeg_average_ref_proj
    assert_array_almost_equal(reref._data[eeg].mean(axis=0), 0, decimal=19)
E   AssertionError: 
E   Arrays are not almost equal to 19 decimals
E   
E   Mismatched elements: 11 / 14400 (0.0764%)
E   Max absolute difference: 1.8973538e-19
E   Max relative difference: inf
E    x: array([4.4017478825648476e-20, 5.3052496929694344e-20,
E          4.4440995299275626e-20, ..., 4.5344497109680210e-20,
E          4.2408116225865306e-20, 5.2346636140315759e-20])
E    y: array(0)
----------------------------- Captured stdout call -----------------------------
Opening raw data file /builddir/build/BUILD/mne-python-1.0.3/mne/tests/../io/tests/data/test_raw.fif...
    Read a total of 3 projection items:
        PCA-v1 (1 x 102)  idle
        PCA-v2 (1 x 102)  idle
        PCA-v3 (1 x 102)  idle
    Range : 25800 ... 40199 =     42.956 ...    66.930 secs
Ready.
Reading 0 ... 14399  =      0.000 ...    23.974 secs...
Adding average EEG reference projection.
1 projection items deactivated
Created an SSP operator (subspace dimension = 4)
4 projection items activated
SSP projectors applied...

musicinmybrain avatar Aug 02 '22 03:08 musicinmybrain

With the Netlib backend (export FLEXIBLAS=NETLIB), the two failures reported here on ppc64le and the one failure reported in #10984 on aarch64 disappear

Cool, so those two architectures now pass all tests, right?

there is, however, a new failure on x86_64

This test looks like it is too strict (19 decimals), the actual difference is still very small. We might consider relaxing the test a bit. Can you see what the smallest number of decimals is until this test passes?

cbrnr avatar Aug 02 '22 07:08 cbrnr

Thanks for looking at this.

Cool, so those two architectures now pass all tests, right?

Correct, on Netlib BLAS. Does this project target only Netlib BLAS, or should these tests pass on other BLASes? Or do you think these are actual bugs in OpenBLAS?

Obviously, we can run the tests with Netlib BLAS in the Fedora RPM build if we like, but we can’t enforce a particular flexiblas backend at runtime.

This test looks like it is too strict (19 decimals), the actual difference is still very small. We might consider relaxing the test a bit. Can you see what the smallest number of decimals is until this test passes?

Based on Max absolute difference: 1.8973538e-19, you would expect 18, and my testing confirms that is sufficient in practice. It is still a very strict bound.

musicinmybrain avatar Aug 02 '22 13:08 musicinmybrain

Correct, on Netlib BLAS. Does this project target only Netlib BLAS, or should these tests pass on other BLASes? Or do you think these are actual bugs in OpenBLAS?

I'm not sure, I really am no expert, but what I know is that OpenBLAS is very fast but sometimes not that mature. We do not target any particular BLAS implementation, ideally it should not matter at all. AFAIK we don't test different BLAS implementations but just take whatever NumPy defaults to on our test platforms (which do not include ARM64).

Obviously, we can run the tests with Netlib BLAS in the Fedora RPM build if we like, but we can’t enforce a particular flexiblas backend at runtime.

We should take a closer look at the failing tests, maybe we can work around this issue somehow. But the more I think about it, the more I believe that this might be an upstream issue. Can you find out which NumPy function is causing the problematic output?

Based on Max absolute difference: 1.8973538e-19, you would expect 18, and my testing confirms that is sufficient in practice. It is still a very strict bound.

Good, I'm OK with lowering the tolerance to 18 digits, but I want to confirm with others. @larsoner WDYT?

cbrnr avatar Aug 02 '22 13:08 cbrnr

Yes 18 digits is fine for that test, feel free to open a PR

Indeed we should dig into whether it's an OpenBLAS bug or if we need to adjust some tolerance at our end internally (e.g., a 1e-6 instead of 1e-7 in a pinv rtol or something that eventually causes a division by zero) to make it work

@musicinmybrain do you have any suggestions for where I could read about how to emulate aarch64 to replicate this failure? Then I could dig into OpenBLAS etc.

larsoner avatar Aug 02 '22 13:08 larsoner

Yes 18 digits is fine for that test, feel free to open a PR

OK!

You could set up a self-hosted runner, they support Linux ARM64. Ideally, you could install Fedora 36.

cbrnr avatar Aug 02 '22 14:08 cbrnr

@musicinmybrain do you have any suggestions for where I could read about how to emulate aarch64 to replicate this failure? Then I could dig into OpenBLAS etc.

Personally, I’m using Fedora 36 on x86_64 (but a Fedora 36 VM could work too). I’ve done dnf install fedora-packager and set up mock to do RPM builds in a chroot, and I’ve done dnf install qemu-user-static to allow software emulation of other architectures.

Then it’s something like:

fedpkg co python-mne
cd python-mne
# Edit the spec file as needed (switch BLAS, uncomment test skips, etc.)
fedpkg mockbuild --root fedora-rawhide-aarch64 --no-cleanup-after

The emulation is not fast, but it works.

Anyone can do the above. As a Fedora packager, I can also run (non-interactive but arbitrary) experiments on real hardware.

Another thing that might work for you is Oracle Cloud’s free tier, which now includes an aarch64 instance. I don’t know of anything comparable for ppc64le, though.

musicinmybrain avatar Aug 02 '22 15:08 musicinmybrain

The emulation is not fast, but it works.

Okay I'll give this a shot. I'm on Ubuntu 22.04 but I'm assuming the commands are easy to translate. I'll post here with what I used if I can replicate, then hopefully I'll be able to isolate if it's an OpenBLAS bug or something at our end.

FYI @musicinmybrain we are going to release 1.1 in the next couple of days. I doubt this tol issue will be fixed by then, but we could fix it in the next few weeks and push out a 1.1.1 if it would help with packaging. But at least for now it seems like https://github.com/mne-tools/mne-python/pull/10993 will land in 1.1, so you could also package 1.1 as long as you can tell it to run test using the Netlib BLAS on that arch for now.

larsoner avatar Aug 02 '22 15:08 larsoner

I'm on Ubuntu 22.04 but I'm assuming the commands are easy to translate. I'll post here with what I used if I can replicate, then hopefully I'll be able to isolate if it's an OpenBLAS bug or something at our end.

It’s not convenient for me to test this, but allegedly, on Ubuntu, you can install qemu-efi and then use virt-manager to set up an emulated VM using the appropriate ISO.

musicinmybrain avatar Aug 02 '22 17:08 musicinmybrain

you can install qemu-efi and then use virt-manager to set up an emulated VM using the appropriate ISO.

I set up a qemu image using ppc64le arch and a pseries CPU (default for that arch), then installed the Fedora 36 ppc64le image, logged in, and did:

$ sudo dnf install -y python3.10 openblas git python3-pip python3-numpy python3-scipy python3-matplotlib python3-threadpoolctl
$ python3.10 -c "import numpy"
Illegal instruction (core dumped)
$ sudo dnf -y install gdb
$ gdb --args python3.10 -c "import numpy; print(numpy)"
... enable debuginfo
(gdb) run
... a lot of downloads

Then:

Screenshot from 2022-08-15 13-40-05

Maybe this is related to https://bugs.launchpad.net/ubuntu-power-systems/+bug/1920784 -- not sure.

So I'm a bit stuck @musicinmybrain -- is there a different CPU you'd recommend that I perhaps try?

larsoner avatar Aug 15 '22 17:08 larsoner