cuda::cuda_driver not linked to aluminum
When I am trying to build lbann via Spack, it does not link the app with -lcuda, just with the runtime. Checking the SetupCUDAToolkit file shows that it does not add cuda::cuda_driver as link target, but just the toolkit library. However, both should be added to avoid some undefined references in libal.so.
Thanks for pointing this out; it's curious we haven't hit this internally. I'll look into it and patch.
I was thinking that cuda::toolkit should add cuda::cuda_driver but maybe it does not anymore since the driver lib is not always required. But aluminum needs it for cudaStreamWrite32 and 3 other functions.
Actually the matter is more complicated: when I am trying to build lbann while being inside a docker container, it works fine. When I am using a Dockerfile issuing the exact same command, it does not build and errors out with linking libcuda:
>> 699 /opt/spack/opt/spack/linux-ubuntu18.04-broadwell/gcc-7.5.0/aluminum
-0.3.3-ewvf5pvu6cj67rmwa5hkd4jj4wlrdj4k/lib/libAl.so: undefined ref
erence to `cuMemHostGetDevicePointer_v2'
>> 700 /opt/spack/opt/spack/linux-ubuntu18.04-broadwell/gcc-7.5.0/aluminum
-0.3.3-ewvf5pvu6cj67rmwa5hkd4jj4wlrdj4k/lib/libAl.so: undefined ref
erence to `cuStreamWriteValue32'
>> 701 /opt/spack/opt/spack/linux-ubuntu18.04-broadwell/gcc-7.5.0/aluminum
-0.3.3-ewvf5pvu6cj67rmwa5hkd4jj4wlrdj4k/lib/libAl.so: undefined ref
erence to `cuGetErrorString'
>> 702 /opt/spack/opt/spack/linux-ubuntu18.04-broadwell/gcc-7.5.0/aluminum
-0.3.3-ewvf5pvu6cj67rmwa5hkd4jj4wlrdj4k/lib/libAl.so: undefined ref
erence to `cuStreamWaitValue32'
>> 703 collect2: error: ld returned 1 exit status
704 ninja: build stopped: subcommand failed.
See build log for details:
I am mapping the Nvidia driver into the container so that should be the case in both cases. Can you grab a regular ubuntu 18.04 image and try building lbann with:
spack install --verbose [email protected] +gpu +nccl ^[email protected]
It should crash with the above error.
Ok, I found it: I need to link libcuda.so from stubs into some directory where the linker can find it. I was expecting it to barf on missing lcuda but I think cmake is smarter than that. If it does not find libcuda during findcuda it won't even add it to the link line, causing undef references. I think that bug can be closed then.