cctbx_project icon indicating copy to clipboard operation
cctbx_project copied to clipboard

boost_adaptbx: method `import_ext` causing segmentation fault

Open dermen opened this issue 1 year ago • 4 comments

TLDR; sys.setdlopenflags(0x100|0x2) in import_ext is causing segfault due to mysterious boost+mpi4py+eigen interactions.

This is quite involved, and I have a patch for it, but I would like to get to the bottom of whats going on.

Assume one has their own boost extension module tester_ext.cpp:

#include <Eigen/Dense>
#include<Eigen/StdVector>
#include <boost/python.hpp>
#include <mpi4py/mpi4py.h>
    
typedef Eigen::Matrix<double,3,1> vec3;
typedef std::vector<vec3,Eigen::aligned_allocator<vec3> > eigVec3_vec;

int test(bool fix_segfault){

    vec3 vec(1,1,1);
    eigVec3_vec vecs;
    if (fix_segfault)
        vecs.reserve(1);
    vecs.push_back(vec);
    printf("OK\n");
    return 1;
}


BOOST_PYTHON_MODULE(tester_ext)
{
  if (import_mpi4py() < 0) return;
  def("run_test", test);
}

Lets also assume that one wishes to use an existing extension module from cctbx (another_ext), whose source code is given by another.cpp:

#include <Eigen/Dense>
#include<Eigen/StdVector>

class another{ 
  another();
  ~another(){};
};

another::another(){
    std::vector<Eigen::Vector3d,Eigen::aligned_allocator<Eigen::Vector3d> > vecs; 
    Eigen::Vector3d vec(0,0,0);
    vecs.push_back(vec);
}

and whose extension wrapper is another_ext.cpp:

BOOST_PYTHON_MODULE(another_ext)
{
  printf("import another\n");
}

After everything is built, if one runs the following python script with the flag --makeSegfault , the segfault can be triggered

import sys
from argparse import ArgumentParser
parser = ArgumentParser()
parser.add_argument("--makeSegfault", action="store_true")
parser.add_argument("--fixSegfault", action="store_true")
args = parser.parse_args()

import boost_adaptbx.boost.python as bp
if args.makeSegfault:
    bp.import_ext("another_ext")
    import tester_ext
else:  # switching the import order avoids the segfault , dont know why
    import tester_ext
    bp.import_ext("another_ext")

tester_ext.run_test(args.fixSegfault)

This issue appears to be platform dependent. I've tested it at NERSC and it segfaults on CORI GPU, but not in Perlmutter. Note replacing bp.import_ext("another_ext") with import another_ext does not trigger the segfault regardless of the --makeSegfault flag. Also, commenting out the line of code if (import_mpi4py() < 0) return; prevents the segfault, regardless of the --makeSegfault flag. Or, instead, if one comments out the line vecs.push_back(vec) in another.cpp, then the segfault is avoided. Lastly, (see build script below), if one leaves out another.o during the linking step that writes another_ext.so, then the segfault won't be triggered.

Example build script for python3.8:

#!/bin/bash

CPRE=/path/to/cctbx/conda_base
CCTBX_MOD=/path/to/cctbx/modules

EIG_INC=-I${CCTBX_MOD}/eigen
CONDA_INC=-I${CPRE}/include
PY_INC=-I${CPRE}/include/python3.8
MPI4PY_INC=$(libtbx.python -c "import mpi4py;print('-I'+mpi4py.get_include())")

CONDA_LIB=-L${CPRE}/lib

g++ -c another.cpp  $EIG_INC $CONDA_INC -lboost_python38 -lboost_system  -lboost_numpy38  -lstdc++ -fPIC -O3   -o another.o  

g++ -c another_ext.cpp  $EIG_INC $CONDA_INC  -lboost_python38 -lboost_system  -lboost_numpy38  -lstdc++ -fPIC -O3   -o another_ext.o  

g++ -shared another_ext.o another.o $CONDA_LIB  -lboost_numpy38  -lboost_python38 -o another_ext.so

mpic++ -c tester_ext.cpp  $EIG_INC $CONDA_INC $PY_INC $MPI4PY_INC  -lboost_python38 -lboost_system  -lboost_numpy38  -lstdc++ -fPIC -O3   -o tester_ext.o  

mpic++ -shared tester_ext.o $CONDA_LIB -lboost_numpy38  -lboost_python38 -o tester_ext.so

dermen avatar Oct 04 '22 03:10 dermen

Note, the relevant line in boost_adaptbx is sys.setdlopenflags(0x100|0x2), commenting out that line prevents the segfault.

dermen avatar Oct 04 '22 04:10 dermen

What happens if bp.import_ext is used to import tester_ext?

bkpoon avatar Oct 04 '22 07:10 bkpoon

I had a recent SEGV issue in BOOST_PYTHON_MODULE that I tracked down to a compiler version bug, though this looks different.

FWIW I'v never liked sys.setdlopenflags(0x100|0x2) - I think it's loading things into RTLD_GLOBAL to compensate for the fact that mostly the build scripts don't link with e.g. -lboost_python38 (which you are doing here, so is redundant); it shouldn't be necessary. I vaguely recall reading an early boost.python thread where RWGK realised this, but it was too late to change in cctbx.

That said, I'm not 100% sure it's the same issue but I've also seen several bugs arise from linking both -lpython3.8. I don't think you are supposed to link to libpython (e.g. the manylinux instructions), because any python interpreter that is loading your dyld will already have it loaded, and it can cause problems if - as in conda, for instance - the python you are using doesn't have a libpython (conda IIRC builds it statically) and so linking to it can cause the system libpython to be picked up instead. Obviously linking to multiple python symbol sets (even ~ the same version) is a recipe for problems, and we've definitely accidentally run into this a couple of times.

In fact, if it is this problem, the RTLD_GLOBAL flag possibly makes sense, because it could be clobbering the already-global symbols that the running interpreter is using?

ndevenish avatar Oct 04 '22 09:10 ndevenish

What happens if bp.import_ext is used to import tester_ext?

@bkpoon , still receive the segfault

That said, I'm not 100% sure it's the same issue but I've also seen several bugs arise from linking both -lpython3.8

@ndevenish Thanks for the tip, I was unaware of this! But if I remove the -lpython3.8 flags and rebuild, I can still generate the segfault. I removed the flags from the example build script

dermen avatar Oct 04 '22 14:10 dermen