cctbx_project
cctbx_project copied to clipboard
boost_adaptbx: method `import_ext` causing segmentation fault
TLDR; sys.setdlopenflags(0x100|0x2)
in import_ext
is causing segfault due to mysterious boost+mpi4py+eigen interactions.
This is quite involved, and I have a patch for it, but I would like to get to the bottom of whats going on.
Assume one has their own boost extension module tester_ext.cpp
:
#include <Eigen/Dense>
#include<Eigen/StdVector>
#include <boost/python.hpp>
#include <mpi4py/mpi4py.h>
typedef Eigen::Matrix<double,3,1> vec3;
typedef std::vector<vec3,Eigen::aligned_allocator<vec3> > eigVec3_vec;
int test(bool fix_segfault){
vec3 vec(1,1,1);
eigVec3_vec vecs;
if (fix_segfault)
vecs.reserve(1);
vecs.push_back(vec);
printf("OK\n");
return 1;
}
BOOST_PYTHON_MODULE(tester_ext)
{
if (import_mpi4py() < 0) return;
def("run_test", test);
}
Lets also assume that one wishes to use an existing extension module from cctbx (another_ext), whose source code is given by another.cpp
:
#include <Eigen/Dense>
#include<Eigen/StdVector>
class another{
another();
~another(){};
};
another::another(){
std::vector<Eigen::Vector3d,Eigen::aligned_allocator<Eigen::Vector3d> > vecs;
Eigen::Vector3d vec(0,0,0);
vecs.push_back(vec);
}
and whose extension wrapper is another_ext.cpp
:
BOOST_PYTHON_MODULE(another_ext)
{
printf("import another\n");
}
After everything is built, if one runs the following python script with the flag --makeSegfault
, the segfault can be triggered
import sys
from argparse import ArgumentParser
parser = ArgumentParser()
parser.add_argument("--makeSegfault", action="store_true")
parser.add_argument("--fixSegfault", action="store_true")
args = parser.parse_args()
import boost_adaptbx.boost.python as bp
if args.makeSegfault:
bp.import_ext("another_ext")
import tester_ext
else: # switching the import order avoids the segfault , dont know why
import tester_ext
bp.import_ext("another_ext")
tester_ext.run_test(args.fixSegfault)
This issue appears to be platform dependent. I've tested it at NERSC and it segfaults on CORI GPU, but not in Perlmutter. Note replacing bp.import_ext("another_ext")
with import another_ext
does not trigger the segfault regardless of the --makeSegfault
flag. Also, commenting out the line of code if (import_mpi4py() < 0) return;
prevents the segfault, regardless of the --makeSegfault
flag. Or, instead, if one comments out the line vecs.push_back(vec)
in another.cpp
, then the segfault is avoided. Lastly, (see build script below), if one leaves out another.o
during the linking step that writes another_ext.so
, then the segfault won't be triggered.
Example build script for python3.8:
#!/bin/bash
CPRE=/path/to/cctbx/conda_base
CCTBX_MOD=/path/to/cctbx/modules
EIG_INC=-I${CCTBX_MOD}/eigen
CONDA_INC=-I${CPRE}/include
PY_INC=-I${CPRE}/include/python3.8
MPI4PY_INC=$(libtbx.python -c "import mpi4py;print('-I'+mpi4py.get_include())")
CONDA_LIB=-L${CPRE}/lib
g++ -c another.cpp $EIG_INC $CONDA_INC -lboost_python38 -lboost_system -lboost_numpy38 -lstdc++ -fPIC -O3 -o another.o
g++ -c another_ext.cpp $EIG_INC $CONDA_INC -lboost_python38 -lboost_system -lboost_numpy38 -lstdc++ -fPIC -O3 -o another_ext.o
g++ -shared another_ext.o another.o $CONDA_LIB -lboost_numpy38 -lboost_python38 -o another_ext.so
mpic++ -c tester_ext.cpp $EIG_INC $CONDA_INC $PY_INC $MPI4PY_INC -lboost_python38 -lboost_system -lboost_numpy38 -lstdc++ -fPIC -O3 -o tester_ext.o
mpic++ -shared tester_ext.o $CONDA_LIB -lboost_numpy38 -lboost_python38 -o tester_ext.so
Note, the relevant line in boost_adaptbx
is sys.setdlopenflags(0x100|0x2)
, commenting out that line prevents the segfault.
What happens if bp.import_ext
is used to import tester_ext
?
I had a recent SEGV issue in BOOST_PYTHON_MODULE
that I tracked down to a compiler version bug, though this looks different.
FWIW I'v never liked sys.setdlopenflags(0x100|0x2)
- I think it's loading things into RTLD_GLOBAL to compensate for the fact that mostly the build scripts don't link with e.g. -lboost_python38
(which you are doing here, so is redundant); it shouldn't be necessary. I vaguely recall reading an early boost.python thread where RWGK realised this, but it was too late to change in cctbx.
That said, I'm not 100% sure it's the same issue but I've also seen several bugs arise from linking both -lpython3.8
. I don't think you are supposed to link to libpython (e.g. the manylinux instructions), because any python interpreter that is loading your dyld will already have it loaded, and it can cause problems if - as in conda, for instance - the python you are using doesn't have a libpython (conda IIRC builds it statically) and so linking to it can cause the system libpython to be picked up instead. Obviously linking to multiple python symbol sets (even ~ the same version) is a recipe for problems, and we've definitely accidentally run into this a couple of times.
In fact, if it is this problem, the RTLD_GLOBAL flag possibly makes sense, because it could be clobbering the already-global symbols that the running interpreter is using?
What happens if
bp.import_ext
is used to importtester_ext
?
@bkpoon , still receive the segfault
That said, I'm not 100% sure it's the same issue but I've also seen several bugs arise from linking both -lpython3.8
@ndevenish Thanks for the tip, I was unaware of this! But if I remove the -lpython3.8 flags and rebuild, I can still generate the segfault. I removed the flags from the example build script