pycbc icon indicating copy to clipboard operation
pycbc copied to clipboard

We can still see FFTW seg-faults :-(

Open spxiwh opened this issue 1 year ago • 23 comments

Susannah Green has identified a case where we can FFTW segfaults. To reproduce this (only tested at CIT, linux rocky 8).

# Create conda env
conda create -n susannah_test python==3.9
conda install -c pytorch pytorch
conda install pycbc
# Create segfault python script
from pycbc.filter import match 
from pycbc.psd.analytical import aLIGO140MpcT1800545
from pycbc.waveform import get_fd_waveform


LOW_FREQ = 12 
SAMPLE_RATE = 4096 
TLEN = 128
DELTA_F = 1.0 / TLEN
PSD = aLIGO140MpcT1800545(1+TLEN*SAMPLE_RATE//2, delta_f=DELTA_F, low_freq_cutoff=LOW_FREQ) 

template, _ = get_fd_waveform(approximant='IMRPhenomXAS', mass1=30, mass2=30,delta_f=DELTA_F, f_lower=LOW_FREQ)
template.resize(len(PSD))
print('Before match')
snr, Index = match(template, template, psd=PSD, low_frequency_cutoff=18)
print('After match')

The important thing I think in reproducing this seems to be a MKL-linked BLAS, basically you want to see:

(susannah_test) [ian.harry@ldas-grid 07:58 am tmp_susannah]$ conda list | grep "libblas"
libblas                   3.9.0                     6_mkl    conda-forge

or something similar, that shows it's MKL linked, or the segfault doesn't happen. Doing conda install pytorch (not using the pytorch channel) seems to pull in something not compiled against MKL, and which works.

I note that there is some discussion of this, and similar issues, on IGWN issues, which I think are not visible to interested observers outside of LVK

https://git.ligo.org/computing/helpdesk/-/issues/3893

https://git.ligo.org/lscsoft/lalsuite/-/issues/300

spxiwh avatar May 15 '23 14:05 spxiwh

I've tried with various combinations of os.RTLD_WHATEVER for the libraries loaded with PyCBC using ctypes (mkl, fftw (and fftw threaded) and gomp) nothing seems to matter.

I tried putting the import commands up front so these libraries would be loaded immediately, didn't make any difference.

I forced the code to use the MKL backend, no segfault as expected.

The segfault happens when the FTTW IFFT function is called (the plan is generated, though may not be valid).

spxiwh avatar May 15 '23 15:05 spxiwh

@spxiwh Is it possible this is actually a change in pytorch? Maybe we can pin down what has occured there if so?

ahnitz avatar May 15 '23 15:05 ahnitz

@spxiwh I was able to reproduce this with a simpler test, but more complex environment. I'm not sure that tells us anything about how different things are conspiring to cause this issue to resurface, but I wanted to note it here.

mamba create -n test_pycbc_with_pytorch torchvision torchaudio pytorch-cuda=11.7 pycbc pandas scikit-learn -c pytorch -c nvidia -c conda-forge

And then the test, which only relies on pycbc:

$ cat test_fft.py 
#!/usr/bin/env python3

import numpy as np
from pycbc.types import complex64, zeros
from pycbc import fft
from pycbc.fft import fftw

fftw._init_threads('unthreaded')

inarr = zeros(64, dtype=complex64)
outarr = zeros(64, dtype=complex64)
fft.fft(inarr, outarr)
print(outarr)

josh-willis avatar May 15 '23 15:05 josh-willis

@josh-willis Does this occur if avoids importing MKL at all (I think mkl gets imported if it is able as part of the package loading, so I'm wondering if this is stopped, you avoid the segfault here).

ahnitz avatar May 15 '23 15:05 ahnitz

So some things I've investigated, using the test above:

  1. As one might expect, LD_PRELOAD=/path/to/libfftw3f.so ./test_fft.py works, so the issue is definitely related to MKL being loaded first (somehow).
  2. I've made the issue go away by creating my own conda package of fftw, where I compile it with -Wl,-Bsymbolic. But linking with -Bsymbolic has issues of its own, and would require either an upstream change, or that we maintain our own conda packaging of fftw (which I'm not enthusiastic about)
  3. I also tried compiling with -fno-gnu-unique based on one possibly similar stack overflow answer I saw, but to no effect.
  4. And these facts are puzzling, since from all of my reading, it looks like linking with Bsymbolic should have the same effect as opening with RTLD_DEEPBIND.
  5. I have not seen any change that looks like it could affect this in either ctypes or the linux dynamic linker. I'm no expert on either of those, but I did look through the former with git blame and at least git greped the latter

So at the moment I'm stumped. I suppose it could be an issue with libffi which ctypes uses after it opens the library. In the man page of dlopen I also see this:

If the same shared object is loaded again with dlopen(), the same object handle is  returned.

which makes me wonder what happens if somehow FFTW were dlopened first without RTLD_DEEPBIND by something else, and that might negate how we open it? Just guessing at this point.

josh-willis avatar May 15 '23 15:05 josh-willis

@josh-willis Does this occur if avoids importing MKL at all (I think mkl gets imported if it is able as part of the package loading, so I'm wondering if this is stopped, you avoid the segfault here).

@ahnitz I reproduced it just now by simply commenting out the MKL check in __init__.py at top-level; have not tried anything more sophisticated than that.

josh-willis avatar May 15 '23 15:05 josh-willis

@josh-willis Does this occur if avoids importing MKL at all (I think mkl gets imported if it is able as part of the package loading, so I'm wondering if this is stopped, you avoid the segfault here).

@ahnitz I reproduced it just now by simply commenting out the MKL check in __init__.py at top-level; have not tried anything more sophisticated than that.

And I reproduced it again by also removing mkl from fft/backend_cpu.py.

My current guess is that in this environment, numpy is linked against MKL, and that is pulling it in, since numpy will always get loaded first as pycbc.types has to have it.

But I still don't understand why RTLD_DEEPBIND doesn't override that, since that seems like its express purpose...

josh-willis avatar May 15 '23 16:05 josh-willis

@josh-willis It does seem to be something else loading FFTW, as you suspected. I changed your test to do:

import ctypes,os
  
ctypes.CDLL('/home/ian.harry/.conda/envs/susannah_test/lib/libfftw3.so.3.6.10', mode=os.RTLD_DEEPBIND)
ctypes.CDLL('/home/ian.harry/.conda/envs/susannah_test/lib/libfftw3f.so.3.6.10', mode=os.RTLD_DEEPBIND)

import numpy as np
from pycbc.types import complex64, zeros
from pycbc import fft
from pycbc.fft import fftw

fftw._init_threads('unthreaded')

inarr = zeros(64, dtype=complex64)
outarr = zeros(64, dtype=complex64)
fft.fft(inarr, outarr)
print(outarr)

(so forcing the CDLL loads before anything is imported, and this works. Without these top 6 lines the code segfaults in the environment constructed as I described above)

spxiwh avatar May 16 '23 09:05 spxiwh

I think this is LAL, because commenting out some of the import lal commands in types/*.py can make Josh's test run .... However, I've not been able to convert this into a sensible reproducible example. It seems that if I import lal in array.py the code segfaults, if I instead first import lal in the script then things seem to work (and sometimes I seem to do the same thing, and what failed before then runs) ??? Very confused by this behaviour. I wonder if there's some other code in play??

spxiwh avatar May 16 '23 09:05 spxiwh

So I think, I now understand the order and the problem (based on what was said above). The following needs to happen to raise the segfault.

  • We need to be using intel-compiled blas (as already discussed).
  • MKL must be imported before FFTW (importing numpy will acheive this)
  • FFTW must then be imported without using RTLD_DEEPBIND (importing lal will acheive this)
  • PyCBC must not have tried to import any of these things itself using ctypes before importing numpy and lal.

So the following code fails:

import ctypes,os
  
import numpy as np
import lal as _lal

ctypes.CDLL('/home/ian.harry/.conda/envs/susannah_test/lib/libfftw3.so.3.6.10', mode=os.RTLD_DEEPBIND)
ctypes.CDLL('/home/ian.harry/.conda/envs/susannah_test/lib/libfftw3f.so.3.6.10', mode=os.RTLD_DEEPBIND)

import numpy as np
from pycbc.types import complex64, zeros
from pycbc import fft
from pycbc.fft import fftw

fftw._init_threads('unthreaded')

inarr = zeros(64, dtype=complex64)
outarr = zeros(64, dtype=complex64)
fft.fft(inarr, outarr)
print(outarr)

If you comment out import lal this works. @josh-willis Any idea how we can get lal to do this import correctly?

spxiwh avatar May 16 '23 09:05 spxiwh

@spxiwh I'm not sure how to enforce this, particularly as it could all be undone if any given user happens to import certain things first for their own reasons. I won't have time to look until later this week, but I am curious if this solution could be made to work for us. But we might have to create a namespace to share among the several related libraries that FFTW needs.

Ian, does this happen for you when lal is linked against mkl, against FFTW, or either?

josh-willis avatar May 16 '23 13:05 josh-willis

@josh-willis Thanks for the pointer. I tried the solution you pointed to but it didn't solve the problem. (Not 100% sure I did it right, but I think I followed the instructions).

I've only tested against lal provided via conda, which is linked against FFTW. I would be very surprised if we see a problem if lal is not linked against FFTW.

There have been a number of issues reported because of lal and mkl-compiled numpy incompatibilities, which I think are exactly this problem (https://git.ligo.org/lscsoft/lalsuite/-/issues/300), so while I agree that one could still produce this error if a user somehow loads FFTW through something else, I think if we there was a way to at least make import lal safe against this error it would reduce the failure rate significantly ... I just have no idea how one would do the equivalent of os.RTLD_DEEPBIND, in lal, and get it into things like conda builds.

spxiwh avatar May 16 '23 14:05 spxiwh

I've only tested against lal provided via conda, which is linked against FFTW. I would be very surprised if we see a problem if lal is not linked against FFTW.

Well, there are two versions of lal provided by conda. If you look at what Duncan mentions further down in the link you provided, it should be possible to install lal but linked against mkl by specifying liblal=*=*mkl* when you create the environment. So I'm curious if that would solve the problem?

josh-willis avatar May 16 '23 19:05 josh-willis

@josh-willis I just tried doing conda install liblal=*=mkl_* ... The environment is now working, but I think some things happened that shouldn't have and that command seems to have removed my MKL compiled libblas and replaced it with:

libblas                   3.9.0           16_linux64_openblas    conda-forge

so this now seems like a bit of a different test to what we had before, but we should now have MKL being imported (now by lal), and nothing importing FFTW until pycbc does.

Full output of command shows a bunch of things being uninstalled, and the pytorch channel installed pytorch being removed in favour of conda forge. All of this suggests that I may now have an unoptimized pytorch, which I don't think is an expected result of installing lal (!?!)

conda install liblal=*=mkl_*
Retrieving notices: ...working... done
Collecting package metadata (current_repodata.json): done
Solving environment: done


==> WARNING: A newer version of conda exists. <==
  current version: 23.1.0
  latest version: 23.3.1

Please update conda by running

    $ conda update -n base -c conda-forge conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.3.1



## Package Plan ##

  environment location: /home/ian.harry/.conda/envs/susannah_test

  added / updated specs:
    - liblal[build=mkl_*]


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    liblal-7.3.1               |   mkl_h8240297_0         2.1 MB  conda-forge
    python-lal-7.3.1           |mkl_py39hc1cc9e4_0         924 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.0 MB

The following NEW packages will be INSTALLED:

  libhwloc           conda-forge/linux-64::libhwloc-2.9.1-hd6dc26d_0 
  libiconv           conda-forge/linux-64::libiconv-1.17-h166bdaf_0 
  libopenblas        conda-forge/linux-64::libopenblas-0.3.21-pthreads_h78a6416_3 
  libprotobuf        conda-forge/linux-64::libprotobuf-3.21.12-h3eb15da_0 
  libxml2            conda-forge/linux-64::libxml2-2.10.4-hfdac1af_0 
  sleef              conda-forge/linux-64::sleef-3.5.1-h9b69904_2 
  tbb                conda-forge/linux-64::tbb-2021.9.0-hf52228f_0 

The following packages will be REMOVED:

  blas-2.106-mkl
  liblapacke-3.9.0-6_mkl

The following packages will be UPDATED:

  libblas                                       3.9.0-6_mkl --> 3.9.0-16_linux64_openblas 
  libcblas                                      3.9.0-6_mkl --> 3.9.0-16_linux64_openblas 
  liblapack                                     3.9.0-6_mkl --> 3.9.0-16_linux64_openblas 
  mkl                                   2020.4-h726a3e6_304 --> 2022.2.1-h84fe81f_16997 

The following packages will be SUPERSEDED by a higher-priority channel:

  pytorch                pytorch::pytorch-2.0.1-py3.9_cpu_0 --> conda-forge::pytorch-2.0.0-cpu_py39he4d1dc0_0 

The following packages will be DOWNGRADED:

  liblal                            7.3.1-fftw_h88093d1_100 --> 7.3.1-mkl_h8240297_0 
  python-lal                    7.3.1-fftw_py39h0bd8420_100 --> 7.3.1-mkl_py39hc1cc9e4_0 


Proceed ([y]/n)? y


Downloading and Extracting Packages
                                                                                
Preparing transaction: done                                                     
Verifying transaction: done
Executing transaction: done

spxiwh avatar May 16 '23 21:05 spxiwh

Susannah also reported that doing conda install liblal=*=mkl_* left her with an environment which was broken due to:

`ImportError: libgsl.so.25: cannot open shared object file: No such file or directory`

though I did not reproduce this.

spxiwh avatar May 16 '23 21:05 spxiwh

One more thing to consider (and apologies if I'm just off track with this).

If I wrap the lal import with:

flags = sys.getdlopenflags()
sys.setdlopenflags(flags | os.RTLD_DEEPBIND)
import lal as _lal
sys.setdlopenflags(flags)

then it seems that this loads FFTW using DEEPBIND as we want, and the test code works. I don't know if there are any other unintended, or undesired, consequences to loading lal in this way, but one could have these lines in the SWIG-produced lal.py file, which itself just imports _lal.so.

spxiwh avatar May 16 '23 21:05 spxiwh

@spxiwh So I managed to create a conda environment where there is no segfault for either my test above, nor for what I understand Susanna's test to be in internal LVK ticket 3893. Assuming that Susanna's test in that ticket is also what failed with a missing libgsl error, then I also do not see that.

Since I was trying to imitate Susanna's environment, what I did was:

mamba create -n test_lal_mkl_with_pycbc torchvision torchaudio pytorch-cuda=11.7 pycbc pandas scikit-learn liblal=*=*mkl*  -c pytorch -c nvidia -c conda-forge

Thus it's much bigger than a minimal reproducing case, but I wanted to make sure that things also worked when pytorch was installed. Two differences from what you mention above:

  1. My syntax for pinning lal is liblal=*=*mkl* , which is not the same as what you reported above.
  2. I also was creating an environment from scratch, rather than starting from a broken environment and trying to reinstall the MKL version of liblal

Maybe @duncanmmacleod can comment on which of those differences was likely to matter, and if we should expect to see the problems you did when you updated liblal to the MKL build in an existing environment, rather than specifying it when you created a new environment?

josh-willis avatar May 17 '23 21:05 josh-willis

So my comment on the lalsuite issue is the further down the rabbit hole I have gone (and am likely to go). The issue is to do with loading of symbols from MKL into the global namespace that have the same name as the ones that LAL is expecting from FFTW, but that operate in a different way, hence the segfault.

As I commented on that thread, I believe that this error should not occur when using the build of liblal that links directly against MKL (and not FFTW).

How the environment is constructed (new vs updating) should not have any bearing - the only important thing is the list of packages you end up with. If someone has a reproducible failing example, please make sure and include the full output of conda list --show-channel-urls on the relevant environment.

duncanmmacleod avatar May 18 '23 09:05 duncanmmacleod

@duncanmmacleod This issue is a little more subtle than that; we had seen some time ago the issue when only MKL was installed, and had in place a fix for that. The new problem requires not just MKL in the environment, but something else (loaded before PyCBC loads FFTW) that is also linked against FFTW, as well as an environment that has MKL installed. I believe I have a fix that will solve this problem in many cases, though there is still an issue in a conda environment that uses the _llvm build of _openmp_mutex.

@spxiwh When you have time, can you post in this ticket the conda command that you used to create your environment; the one in which you then tried to update the liblal package to specifically use mkl, and then saw problems. I would like to see if I can reproduce this.

josh-willis avatar May 19 '23 16:05 josh-willis

So as an update to some testing I have done, the minimal reproducing conda environment that I have created that shows these problems can be made with:

mamba create -n pycbc_mkl_blas -c conda-forge pycbc mkl blas=*=*mkl blas-devel=*=*_mkl libblas=*=*_mkl libcblas=*=*_mkl liblapack=*=*_mkl

If instead I create an environment with just pycbc and mkl, or just pycbc and _openmp_mutex=*=*_llvm, then I do not see the problem. I haven't tried installing a mix of MKL and openblas versions of the various blas, libblas, libcblas, and liblapack; that seems like a recipe for inconsistency.

But I could believe that what is happening is that when the linear algebra libraries are linked against MKL, then so is numpy. Since numpy gets loaded first in pycbc, it loads the fftw symbols ahead of anything else. But I still don't understand why opening the fftw library itself with DEEPBIND has stopped working.

josh-willis avatar Jun 05 '23 21:06 josh-willis

So a further update. I can make this problem go away in a conda environment with the MKL build of blas if I also specify the MKL build of liblal. So to summarize, I think the necessary conditions for at least this bug to appear (all tested with conda, not any other package manager) are:

  1. In an environment with PyCBC, also install MKL
  2. Also in that environment, install the MKL build variants of libblas/libcblas (which appears to also force the MKL build of liblapack and liblapacke).
  3. Finally, in that environment, install the FFTW build of liblal.

In particular, at least with our use of RTLD_DEEPBIND, it's not enough to have just MKL installed nor even the MKL variant of BLAS; we must also not have the MKL variant of LAL. I'm not sure, however, how to best capture this: does it belong in PyCBC's dependencies? LAL's? I'll ask @duncanmmacleod and @ahnitz for feedback.

josh-willis avatar Jun 05 '23 22:06 josh-willis

@josh-willis I'm betting this can be reproduced without really using PyCBC, but just doing the basic try to get fftw from ctypes and run it. If that's a case, it's worth understanding what lalsuite is doing and if it is something unusual or not. Given the behavior though, it seems like a fix is potentially needed on the lalsuite side.

ahnitz avatar Jun 05 '23 22:06 ahnitz

@ahnitz So my guess is the following. RTLD_DEEPBIND is supposed to ensure that when we load a library, it tries to resolve any missing symbols within that library first, rather than the global namespace of the executable. That should have fixed the issue where MKL defines some of the same symbols as FFTW.

However, I think (don't have the reference handy) that the dynamic loader only loads any given library once, and then uses a refcounting system if something tries to open it again, to point it back to what's already opened. So I think that by having the FFTW variant of LAL, we're getting FFTW the library loaded first, but because LAL does not use the DEEPBIND method (I don't think it really can, since it doesn't call dlopen directly) it does get the namespace pollution of MKL.

Which would imply that if anyone installs some other package depending on FFTW in an environment in which they also install PyCBC and we have BLAS coming from MKL, we might see the same issue. And I'm not convinced that this is the cause of the CI failures that also came up recently, since those seem to require multi-threaded FFTs.

josh-willis avatar Jun 05 '23 23:06 josh-willis