shifter icon indicating copy to clipboard operation
shifter copied to clipboard

GPU support under Ubuntu

Open kushnirm opened this issue 6 years ago • 8 comments

OS is Ubuntu 16.04. Nvidia drivers installed and working fine. Nvidia drivers and CUDA work fine in nvidia-docker. Using driver 384.111 and CUDA 9.0 for testing. Slurm+shifter working fine.

But, under shifter, I can't get GPU integration to work quite right. When running an image with nvidia-docker, drivers and utilities like nvidia-smi are available and work. When running the same container via shifter they are not.

If I make a copy of /usr/lib/nvidia-384 to my siteFs, and set the PATH and LD_LIBRARY_PATH, nvidia-smi retuns the expected output. However, CUDA demo apps (i.e. deviceQuery, etc...) says:

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 35 -> CUDA driver version is insufficient for CUDA runtime version Result = FAIL

Thanks, Michael

kushnirm avatar May 14 '18 17:05 kushnirm

On further review, looks like some of the GPU related bind mounts are not being automatically created. I found the contrib/gpu_activate_gpu_support.sh script, But, I am not sure when, where, or how it is being invoked.

Please advise.

Thanks, Michael

kushnirm avatar May 23 '18 16:05 kushnirm

Michael,

We don't have a GPU system to test with at NERSC. Let me ping some of the CSCS folks and see if they can comment.

scanon avatar Jun 28 '18 00:06 scanon

I have the same question, how does contrib/gpu/activate_gpu_support.sh is intended to be used?

uvNikita avatar Aug 20 '18 14:08 uvNikita

Found commit that removed the code which was using this script: https://github.com/NERSC/shifter/commit/c5e66cc07192138ab3e9d4b3a43c1815153ca274, but I don't see any replacement for this functionality.

uvNikita avatar Aug 20 '18 14:08 uvNikita

We received no guidance on this and never got it to work. Sorry. Singularity worked as an alternative for us.

Cheers, Michael

-------- Original Message -------- From: Nikita Uvarov <[email protected]mailto:[email protected]> Date: Mon, Aug 20, 2018, 10:55 AM To: NERSC/shifter <[email protected]mailto:[email protected]> CC: "Kushnir, Michael (NIH/NLM/LHC) [C]" <[email protected]mailto:[email protected]>,Author <[email protected]mailto:[email protected]> Subject: Re: [NERSC/shifter] GPU support under Ubuntu (#223)

Found commit that removed the code which was using this script: c5e66cchttps://github.com/NERSC/shifter/commit/c5e66cc07192138ab3e9d4b3a43c1815153ca274, but I don't see any replacement for this functionality.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/NERSC/shifter/issues/223#issuecomment-414345065, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AldJnzTAI3l09UrATxNUWzKPoP3qweFLks5uSs2ggaJpZM4T-Mk1.

kushnirm avatar Aug 20 '18 15:08 kushnirm

NERSC will hopefully be able to help more directly on this in the near future.

scanon avatar Aug 20 '18 16:08 scanon

We are also facing the same issue. With /usr/lib/nvidia-384 loaded into the container, nvidia-smi is showing the GPUs present on the node. But when we try to execute deviceQuery and nbody benchwork it is throwing same error as CUDA driver version is insufficient for CUDA runtime version Result = FAIL

Is there anyway another way to test GPUs with shifter and slurm integration?

sk2991 avatar Aug 23 '18 04:08 sk2991

After digging through sources and git history, it seems that the plan is to replace an old GPU support with the new module system, see doc/modules.rst and doc/config/udiRoot.conf.rst.

So, we added these lines to our config:

module_nvidia_siteEnvAppend=LD_LIBRARY_PATH=/opt/udiImage/modules/nvidia PATH=/nvidia-bin PATH=/cuda/bin
module_nvidia_siteFs=/usr/bin:/nvidia-bin;/usr/local/cuda:/cuda
module_nvidia_copyPath=/usr/lib64/nvidia

After this, users can start jobs which require nvidia libraries by specifying shifter --module nvidia.

uvNikita avatar Aug 23 '18 08:08 uvNikita