pytorch could not detect Nvidia driver on bottlerocket
Sorry, I don't have the whole details, but I'd like to report that I had issues using pytorch on Bottlerocket image for EKS.
When I switch to AL2 GPU AMI, it worked without an issue.
- EKS 1.29
- node group with the default launch template (so the latest AMI image)
- instance type: g4dn.xlarge
- The EKS cluster don't use nvidia device driver / gpu operator,
AMI
- BOTTLEROCKET_X86_64_NVIDIA: ami-0d31d8d1285f91827 - bottlerocket-aws-k8s-1.29-nvidia-x86_64-v1.19.4-4f0a078e
- AL2_x86_64_GPU: ami-093bb52bc444e09ba - amazon-eks-gpu-node-1.29-v20240415
In both AMIs nvidia kernel mod seems to be loaded.. but with different params.
cat /proc/driver/nvidia/version
BOTTLEROCKET_x86_64_NVIDIA:
NVRM version: NVIDIA UNIX x86_64 Kernel Module 535.161.07 Sat Feb 17 22:55:48 UTC 2024
GCC version: gcc version 11.3.0 (Buildroot 2022.11.1)
AL2_x86_64_GPU:
NVRM version: NVIDIA UNIX x86_64 Kernel Module 535.161.08 Tue Mar 5 22:42:15 UTC 2024
GCC version: gcc version 10.5.0 20230707 (Red Hat 10.5.0-1) (GCC)
cat /proc/driver/nvidia/params
BOTTLEROCKET_x86_64_NVIDIA:
ResmanDebugLevel: 4294967295
RmLogonRC: 1
ModifyDeviceFiles: 1
DeviceFileUID: 0
DeviceFileGID: 0
DeviceFileMode: 438
InitializeSystemMemoryAllocations: 1
UsePageAttributeTable: 4294967295
EnableMSI: 1
EnablePCIeGen3: 0
MemoryPoolSize: 0
KMallocHeapMaxSize: 0
VMallocHeapMaxSize: 0
IgnoreMMIOCheck: 0
TCEBypassMode: 0
EnableStreamMemOPs: 0
EnableUserNUMAManagement: 1
NvLinkDisable: 0
RmProfilingAdminOnly: 1
PreserveVideoMemoryAllocations: 0
EnableS0ixPowerManagement: 0
S0ixPowerManagementVideoMemoryThreshold: 256
DynamicPowerManagement: 3
DynamicPowerManagementVideoMemoryThreshold: 200
RegisterPCIDriver: 1
EnablePCIERelaxedOrderingMode: 0
EnableResizableBar: 0
EnableGpuFirmware: 18
EnableGpuFirmwareLogs: 2
EnableDbgBreakpoint: 0
OpenRmEnableUnsupportedGpus: 0
DmaRemapPeerMmio: 1
RegistryDwords: ""
RegistryDwordsPerDevice: ""
RmMsg: ""
GpuBlacklist: ""
TemporaryFilePath: ""
ExcludedGpus: ""
AL2_x86_64_GPU:
ResmanDebugLevel: 4294967295
RmLogonRC: 1
ModifyDeviceFiles: 0
DeviceFileUID: 0
DeviceFileGID: 0
DeviceFileMode: 438
InitializeSystemMemoryAllocations: 1
UsePageAttributeTable: 4294967295
EnableMSI: 1
EnablePCIeGen3: 0
MemoryPoolSize: 0
KMallocHeapMaxSize: 0
VMallocHeapMaxSize: 0
IgnoreMMIOCheck: 0
TCEBypassMode: 0
EnableStreamMemOPs: 0
EnableUserNUMAManagement: 1
NvLinkDisable: 0
RmProfilingAdminOnly: 1
PreserveVideoMemoryAllocations: 0
EnableS0ixPowerManagement: 0
S0ixPowerManagementVideoMemoryThreshold: 256
DynamicPowerManagement: 3
DynamicPowerManagementVideoMemoryThreshold: 200
RegisterPCIDriver: 1
EnablePCIERelaxedOrderingMode: 0
EnableResizableBar: 0
EnableGpuFirmware: 0
EnableGpuFirmwareLogs: 2
EnableDbgBreakpoint: 0
OpenRmEnableUnsupportedGpus: 0
DmaRemapPeerMmio: 1
RegistryDwords: ""
RegistryDwordsPerDevice: ""
RmMsg: ""
GpuBlacklist: ""
TemporaryFilePath: ""
ExcludedGpus: ""
However, pytorch failed to detect the driver in Bottleoeckt
Only in BOTTLEROCKET_x86_64_NVIDIA:
python -c "import torch; torch.cuda.current_device()"
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
python -m torch.utils.collect_env:
BOTTLEROCKET_x86_64_NVIDIA:
PyTorch version: 2.0.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.28.1
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.1.82-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
AL2_x86_64_GPU:
PyTorch version: 2.0.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.28.1
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.10.213-201.855.amzn2.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 535.161.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Used python packages
[pip3] numpy==1.26.3
[pip3] pytorch-lightning==2.1.3
[pip3] pytorch-metric-learning==2.4.1
[pip3] torch==2.0.1+cu117
[pip3] torch-audiomentations==0.11.0
[pip3] torch-pitch-shift==1.2.4
[pip3] torchaudio==2.0.2
[pip3] torchmetrics==1.3.0.post0
Could it be related to https://github.com/awslabs/amazon-eks-ami/issues/1523 ?
If it needs to be reported to https://github.com/awslabs/amazon-eks-ami/issues then please let me know.
Hello @chulkilee, thanks for cutting this issue! I don't believe this would be related to GSP on g4dn.xlarge instances but you could follow https://github.com/bottlerocket-os/bottlerocket/issues/3817#issuecomment-1997151422 just to confirm that isn't the problem.
The difference in the output between Bottlerocket and Amazon Linux for the module config is:
Bottlerocket: ModifyDeviceFiles: 1
Amazon Linux: ModifyDeviceFiles: 0
Bottlerocket: EnableGpuFirmware: 18
Amazon Linux: EnableGpuFirmware: 0
EnableGpuFirmware is the GSP change and ModifyDeviceFiles will disable dynamic device file management when set to 0.
What is strange is that pytorch is reporting that CUDA is not available when it really should be since the other things you called out are there.
Can you also confirm what your podspec looks like just to make sure all the right settings are being passed from that perspective?
Hello @chulkilee, I just tried using an image from NVIDIA to confirm that pytorch can see the devices on a g4dn.xlarge node with latest bottlerocket and I don't get the same issue:
# python -c "import torch; print(torch.cuda.get_device_name(0))"
Tesla T4
Can you confirm which base container you are using and which CUDA version is included? I'm not able to replicate with the image I got.
@chulkilee do your container images contain the following environment variables?
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
If not, I would suggest adding them
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
Those were set. I'm using nvidia/cuda:11.8.0-base-ubuntu22.04 image - but still failing.
Update
declare -x CUDA_VERSION="11.8.0"
declare -x NVIDIA_REQUIRE_CUDA="cuda>=11.8 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=510,driver<511 brand=unknown,driver>=510,driver<511 brand=nvidia,driver>=510,driver<511 brand=nvidiartx,driver>=510,driver<511 brand=geforce,driver>=510,driver<511 brand=geforcertx,driver>=510,driver<511 brand=quadro,driver>=510,driver<511 brand=quadrortx,driver>=510,driver<511 brand=titan,driver>=510,driver<511 brand=titanrtx,driver>=510,driver<511 brand=tesla,driver>=515,driver<516 brand=unknown,driver>=515,driver<516 brand=nvidia,driver>=515,driver<516 brand=nvidiartx,driver>=515,driver<516 brand=geforce,driver>=515,driver<516 brand=geforcertx,driver>=515,driver<516 brand=quadro,driver>=515,driver<516 brand=quadrortx,driver>=515,driver<516 brand=titan,driver>=515,driver<516 brand=titanrtx,driver>=515,driver<516"
declare -x NV_CUDA_COMPAT_PACKAGE="cuda-compat-11-8"
declare -x NV_CUDA_CUDART_VERSION="11.8.89-1"
Even I unset NVIDIA_REQUIRE_CUDA - it still fails with the same error.
I also tested the same image with 1.19.4-4f0a078e and 1.19.5-64049ba8 AMI releases - both failed.
@chulkilee , are you requesting GPUs in your pod specs? Or, do you need to oversubscribe your GPUs and thus you use NVIDIA_VISIBLE_DEVICES=all to get access to all the GPUs in the instance from your pod?
I tested the g5g.xlarge instance with the BOTTLEROCKET_ARM_64_NVIDIA image using the nvcr.io/nvidia/pytorch:24.03-py3 container. I observed that CUDA is only detected when the GPU resource is explicitly specified.
According to the Kubernetes documentation, GPUs can be utilized by requesting the custom GPU resource. However, the documentation does not clarify the expected behavior when the GPU resource is not specified, even if GPUs are available on the node.
You can consume these GPUs from your containers by requesting the custom GPU resource, the same way you request CPU or memory. However, there are some limitations in how you specify the resource requirements for custom devices.
It's important to note that this behavior differs from what is observed when using the Amazon Linux image.
pod yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
nodeSelector:
node.kubernetes.io/instance-type: g5g.xlarge
containers:
- name: shell
image: nvcr.io/nvidia/pytorch:24.03-py3
command: [sleep, "3600"]
- name: shell2
image: nvcr.io/nvidia/pytorch:24.03-py3
command: [sleep, "3600"]
resources:
limits:
nvidia.com/gpu: 1
I observed that CUDA is only detected when the GPU resource is explicitly specified.
I believe this is post my PR https://github.com/bottlerocket-os/bottlerocket/pull/3718 which enables correct allocation of GPUs
Idea is that just by mentioning ENV NVIDIA_VISIBLE_DEVICES all in the image, a container should not be able to steal all gpus on the node
It's important to note that this behavior differs from what is observed when using the Amazon Linux image.
This is correct, ideally AL2 images should also be enabling the following config/env on container toolkit
ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED=false
ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS=true
to enable correct allocation and isolation. Someone really needs to push for this, alas it might face some backlash because it changes behavior in backwards incompatible way
Thanks for the confirmation! I agree that allocating gpu only when requested is better behavior... but changing AL is not easy.. I hope it happens on major AL version bump..
I'm closing this as this is not an issue on bottlerocket side.
FWIW @chiragjn, there are users that relied on the NVIDIA_VISIBLE_DEVICES=all behavior that you helped us disable (thanks again). @chulkilee , we will be exposing an API to allow changing the default configurations for the settings described above in an upcoming release (see #4182), and we are planning to add support for time slicing soon, so that a GPU can be oversubscribed but with more control on which pods get access to it.