Can not find device after 565+ on GH200 NVL2
NVIDIA Open GPU Kernel Modules Version
nvidia-driver-565-open(565.57.01), nvidia-driver-570-open (570.86.15)
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
- [ ] I confirm that this does not happen with the proprietary driver package.
Operating System and Version
Ubuntu 24.04.1 LTS
Kernel Release
Linux 6.11.0-1002-nvidia-64k #2-Ubuntu SMP PREEMPT_DYNAMIC Wed Oct 23 19:17:25 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- [x] I am running on a stable kernel release.
Hardware: GPU
GPU 0: NVIDIA GH200 144G HBM3e | GPU 1: NVIDIA GH200 144G HBM3e
Describe the bug
Installing ubuntu 24.04 on a new GH200 NVL2 system, using apt NVIDIA open kernel driver. Both GPUs not found when using the nvidia-driver-565-open or nvidia-driver-570-open apt package. But both GPU found when using nvidia-driver-560-open
To Reproduce
sudo apt install nvidia-driver-565-open nvidia-smi No devices were found
sudo apt install nvidia-driver-570-open nvidia-smi No devices were found
sudo apt install nvidia-driver-560-open nvidia-smi both GPUs found.
Both GPUs are at 96.00.A0.00.01 VBIOS Which should be newer than 96.00.68.00.xx https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-565-57-01/index.html#known-issues
Bug Incidence
Always
nvidia-bug-report.log.gz
More Info
No response
Could you also generate an nvidia-bug-report.log.gz for 560? It may help to compare logs between working and failing configurations.
nvidia-bug-report.log.gz This is from 560.35.05, but collection hang, let me know if it's good for diagnosis
If I'm reading the log correctly, 560.35.05 has the same symptom as 570.86.15:
[ 6.488488] kernel: NVRM: numa memblock size of zero found during device start
[ 6.488492] kernel: [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00090100] Failed to allocate NvKmsKapiDevice
[ 6.504217] kernel: [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00090100] Failed to register device
Once the GPU gets into this state, I think the problem will persist until a reboot. Could I trouble you to do the following for each of 560.35.05 and 570.86.15?
- install driver
- reboot
- run nvidia-smi
- run nvidia-bug-report.sh
I did reboot between install, or nvidia-smi will show driver mismatch. Currently I'm running 560.35.05 (New bug report) and GPUs show up in nvidia-smi and serving llama model correctly.
I will try to perform the sequence again for both versions tonight at down time and get back to you. Thanks!
- Installed 570.86.15
- reboot
- run nvidia-smi -> No devices were found
- run nvidia-bug-report.sh
- Installed 565.57.01
- reboot
- run nvidia-smi -> No devices were found
- run nvidia-bug-report.sh
- Installed 560
- reboot
- run nvidia-smi -> both GPUs were found
- run nvidia-bug-report.sh
Any update?
We still face this issue and staying on 560 open driver...
@aritger hey nvidia, are you going to fix this bug?
Sorry, I've been busy and not had time to look at this. Looking back at the logs in the earlier comment https://github.com/NVIDIA/open-gpu-kernel-modules/issues/774#issuecomment-2632818464, it looks like we get the message
[ 6.488482] NVRM: numa memblock size of zero found during device start
[ 6.488486] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00190100] Failed to allocate NvKmsKapiDevice
[ 6.502476] [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00190100] Failed to register device
even in 560 where both GPUs are correct detected. So, I guess that error message is not a foothold for debugging. But, it helps to know the regression occurred between 560 and 565, so thank you for that. We'll have to get an internal reproduction to debug further.
Thanks! Our system is https://www.supermicro.com/en/products/system/gpu/2u/ars-221gl-nhir if it helps.