bottlerocket
bottlerocket copied to clipboard
pytorch could not detect Nvidia driver on bottlerocket
Sorry, I don't have the whole details, but I'd like to report that I had issues using pytorch on Bottlerocket image for EKS.
When I switch to AL2 GPU AMI, it worked without an issue.
- EKS 1.29
- node group with the default launch template (so the latest AMI image)
- instance type: g4dn.xlarge
- The EKS cluster don't use nvidia device driver / gpu operator,
AMI
- BOTTLEROCKET_X86_64_NVIDIA: ami-0d31d8d1285f91827 - bottlerocket-aws-k8s-1.29-nvidia-x86_64-v1.19.4-4f0a078e
- AL2_x86_64_GPU: ami-093bb52bc444e09ba - amazon-eks-gpu-node-1.29-v20240415
In both AMIs nvidia kernel mod seems to be loaded.. but with different params.
cat /proc/driver/nvidia/version
BOTTLEROCKET_x86_64_NVIDIA:
NVRM version: NVIDIA UNIX x86_64 Kernel Module 535.161.07 Sat Feb 17 22:55:48 UTC 2024
GCC version: gcc version 11.3.0 (Buildroot 2022.11.1)
AL2_x86_64_GPU:
NVRM version: NVIDIA UNIX x86_64 Kernel Module 535.161.08 Tue Mar 5 22:42:15 UTC 2024
GCC version: gcc version 10.5.0 20230707 (Red Hat 10.5.0-1) (GCC)
cat /proc/driver/nvidia/params
BOTTLEROCKET_x86_64_NVIDIA:
ResmanDebugLevel: 4294967295
RmLogonRC: 1
ModifyDeviceFiles: 1
DeviceFileUID: 0
DeviceFileGID: 0
DeviceFileMode: 438
InitializeSystemMemoryAllocations: 1
UsePageAttributeTable: 4294967295
EnableMSI: 1
EnablePCIeGen3: 0
MemoryPoolSize: 0
KMallocHeapMaxSize: 0
VMallocHeapMaxSize: 0
IgnoreMMIOCheck: 0
TCEBypassMode: 0
EnableStreamMemOPs: 0
EnableUserNUMAManagement: 1
NvLinkDisable: 0
RmProfilingAdminOnly: 1
PreserveVideoMemoryAllocations: 0
EnableS0ixPowerManagement: 0
S0ixPowerManagementVideoMemoryThreshold: 256
DynamicPowerManagement: 3
DynamicPowerManagementVideoMemoryThreshold: 200
RegisterPCIDriver: 1
EnablePCIERelaxedOrderingMode: 0
EnableResizableBar: 0
EnableGpuFirmware: 18
EnableGpuFirmwareLogs: 2
EnableDbgBreakpoint: 0
OpenRmEnableUnsupportedGpus: 0
DmaRemapPeerMmio: 1
RegistryDwords: ""
RegistryDwordsPerDevice: ""
RmMsg: ""
GpuBlacklist: ""
TemporaryFilePath: ""
ExcludedGpus: ""
AL2_x86_64_GPU:
ResmanDebugLevel: 4294967295
RmLogonRC: 1
ModifyDeviceFiles: 0
DeviceFileUID: 0
DeviceFileGID: 0
DeviceFileMode: 438
InitializeSystemMemoryAllocations: 1
UsePageAttributeTable: 4294967295
EnableMSI: 1
EnablePCIeGen3: 0
MemoryPoolSize: 0
KMallocHeapMaxSize: 0
VMallocHeapMaxSize: 0
IgnoreMMIOCheck: 0
TCEBypassMode: 0
EnableStreamMemOPs: 0
EnableUserNUMAManagement: 1
NvLinkDisable: 0
RmProfilingAdminOnly: 1
PreserveVideoMemoryAllocations: 0
EnableS0ixPowerManagement: 0
S0ixPowerManagementVideoMemoryThreshold: 256
DynamicPowerManagement: 3
DynamicPowerManagementVideoMemoryThreshold: 200
RegisterPCIDriver: 1
EnablePCIERelaxedOrderingMode: 0
EnableResizableBar: 0
EnableGpuFirmware: 0
EnableGpuFirmwareLogs: 2
EnableDbgBreakpoint: 0
OpenRmEnableUnsupportedGpus: 0
DmaRemapPeerMmio: 1
RegistryDwords: ""
RegistryDwordsPerDevice: ""
RmMsg: ""
GpuBlacklist: ""
TemporaryFilePath: ""
ExcludedGpus: ""
However, pytorch failed to detect the driver in Bottleoeckt
Only in BOTTLEROCKET_x86_64_NVIDIA:
python -c "import torch; torch.cuda.current_device()"
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
python -m torch.utils.collect_env
:
BOTTLEROCKET_x86_64_NVIDIA:
PyTorch version: 2.0.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.28.1
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.1.82-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
AL2_x86_64_GPU:
PyTorch version: 2.0.1+cu117
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.28.1
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.10.213-201.855.amzn2.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 535.161.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Used python packages
[pip3] numpy==1.26.3
[pip3] pytorch-lightning==2.1.3
[pip3] pytorch-metric-learning==2.4.1
[pip3] torch==2.0.1+cu117
[pip3] torch-audiomentations==0.11.0
[pip3] torch-pitch-shift==1.2.4
[pip3] torchaudio==2.0.2
[pip3] torchmetrics==1.3.0.post0
Could it be related to https://github.com/awslabs/amazon-eks-ami/issues/1523 ?