celeritas icon indicating copy to clipboard operation
celeritas copied to clipboard

Undefined references with CudaRDCUtils and vecgeom

Open drbenmorgan opened this issue 1 year ago • 1 comments

Whilst we've known for a while that linking VecGeom requires --no-as-needed to be explicitly passed to the linker of platforms that enable as-needed by default (e.g. Debian, Ubuntu), I think there's a more general issue/bug in the link/object structure created by CudaRDCUtils. AFAICT, the problem comes from as-needed making library link order important.

If we build Celeritas with LDFLAGS=-Wl,--as-needed on Alma9, then we get errors like the following (stripped down to highlight to causes):

[795/970] Linking CXX executable test/celeritas/celeritas_user_Diagnostic
FAILED: test/celeritas/celeritas_user_Diagnostic 
: && /usr/bin/c++ -Wall -Wextra -pedantic -fdiagnostics-color=always -O3 -DNDEBUG -Wl,--as-needed   
... 
/.../.spack-env/view/lib64/libvecgeomcuda.so  
/.../.spack-env/view/lib64/libvecgeom.a 
...
/usr/bin/ld: /.../.spack-env/view/lib64/libvecgeom.a(UnplacedCone.cpp.o): undefined reference to symbol '_ZNK7vecgeom3cxx9DevicePtrINS_4cuda13SUnplacedConeINS2_9ConeTypes13UniversalConeEEEE9ConstructIJdddddddEEEvDpT_'
/usr/bin/ld: /.../.spack-env/view/lib64/libvecgeomcuda.so: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status

The problem is that libvecgeom.a (or .so were that there) needs symbols defined in libvecgeomcuda.so but as that occurs before libvecgeom, the linker blindly ignores it. That this is the case can be shown be modifying the link line to:

[795/970] Linking CXX executable test/celeritas/celeritas_user_Diagnostic
FAILED: test/celeritas/celeritas_user_Diagnostic 
: && /usr/bin/c++ -Wall -Wextra -pedantic -fdiagnostics-color=always -O3 -DNDEBUG -Wl,--as-needed   
... 
/.../.spack-env/view/lib64/libvecgeomcuda.so  
/.../.spack-env/view/lib64/libvecgeom.a 
**/.../.spack-env/view/lib64/libvecgeomcuda.so**

...
/usr/bin/ld: /.../.spack-env/view/lib64/libvecgeom.a(UnplacedCone.cpp.o): undefined reference to symbol '_ZNK7vecgeom3cxx9DevicePtrINS_4cuda13SUnplacedConeINS2_9ConeTypes13UniversalConeEEEE9ConstructIJdddddddEEEvDpT_'
/usr/bin/ld: /.../.spack-env/view/lib64/libvecgeomcuda.so: error adding symbols: DSO missing from command line
collect2: error: ld returned 1 exit status

which will resolve the vecgeom error but create a whole new set:

/usr/bin/ld: lib64/libceleritas.so: undefined reference to `__cudaRegisterLinkedBinary_5cf4974c_31_AlongStepGeneralLinearAction_cu_f9b8e781'

In this case it's because the link contains libceleritas_final.so then libceleritas.so, the latter needing symbols from the former. Adding libceleritas_final.so after libceleritas.so fixes the link error, illustrating that it's a general problem with the RDC link structure:

  • The "middle" library needs symbols from the "final" library but does not depend on it
  • The "final" library needs symbols from the "middle" library and depends on it
  • There's therefore a circular dependence, with the way it's currently resolved resulting in a link order that isn't compatible with linkers using as-needed.

We should fix this in CudaRDCUtils though I'm not exactly sure how right now, so comments and discussion welcome.

drbenmorgan avatar Mar 18 '24 17:03 drbenmorgan

@drbenmorgan I assume this was fixed some time ago by https://gitlab.cern.ch/VecGeom/VecGeom/-/merge_requests/1296 ?

sethrj avatar Sep 26 '25 17:09 sethrj