lbann icon indicating copy to clipboard operation
lbann copied to clipboard

cuda::cuda_driver not linked to aluminum

Open azrael417 opened this issue 5 years ago • 4 comments

When I am trying to build lbann via Spack, it does not link the app with -lcuda, just with the runtime. Checking the SetupCUDAToolkit file shows that it does not add cuda::cuda_driver as link target, but just the toolkit library. However, both should be added to avoid some undefined references in libal.so.

azrael417 avatar Jul 09 '20 22:07 azrael417

Thanks for pointing this out; it's curious we haven't hit this internally. I'll look into it and patch.

benson31 avatar Jul 09 '20 22:07 benson31

I was thinking that cuda::toolkit should add cuda::cuda_driver but maybe it does not anymore since the driver lib is not always required. But aluminum needs it for cudaStreamWrite32 and 3 other functions.

azrael417 avatar Jul 10 '20 15:07 azrael417

Actually the matter is more complicated: when I am trying to build lbann while being inside a docker container, it works fine. When I am using a Dockerfile issuing the exact same command, it does not build and errors out with linking libcuda:

  >> 699    /opt/spack/opt/spack/linux-ubuntu18.04-broadwell/gcc-7.5.0/aluminum
            -0.3.3-ewvf5pvu6cj67rmwa5hkd4jj4wlrdj4k/lib/libAl.so: undefined ref
            erence to `cuMemHostGetDevicePointer_v2'
  >> 700    /opt/spack/opt/spack/linux-ubuntu18.04-broadwell/gcc-7.5.0/aluminum
            -0.3.3-ewvf5pvu6cj67rmwa5hkd4jj4wlrdj4k/lib/libAl.so: undefined ref
            erence to `cuStreamWriteValue32'
  >> 701    /opt/spack/opt/spack/linux-ubuntu18.04-broadwell/gcc-7.5.0/aluminum
            -0.3.3-ewvf5pvu6cj67rmwa5hkd4jj4wlrdj4k/lib/libAl.so: undefined ref
            erence to `cuGetErrorString'
  >> 702    /opt/spack/opt/spack/linux-ubuntu18.04-broadwell/gcc-7.5.0/aluminum
            -0.3.3-ewvf5pvu6cj67rmwa5hkd4jj4wlrdj4k/lib/libAl.so: undefined ref
            erence to `cuStreamWaitValue32'
  >> 703    collect2: error: ld returned 1 exit status
     704    ninja: build stopped: subcommand failed.

See build log for details:

I am mapping the Nvidia driver into the container so that should be the case in both cases. Can you grab a regular ubuntu 18.04 image and try building lbann with:

spack install --verbose [email protected] +gpu +nccl ^[email protected]

It should crash with the above error.

azrael417 avatar Jul 10 '20 16:07 azrael417

Ok, I found it: I need to link libcuda.so from stubs into some directory where the linker can find it. I was expecting it to barf on missing lcuda but I think cmake is smarter than that. If it does not find libcuda during findcuda it won't even add it to the link line, causing undef references. I think that bug can be closed then.

azrael417 avatar Jul 10 '20 16:07 azrael417