open-gpu-kernel-modules Can not find device after 565+ on GH200 NVL2

NVIDIA Open GPU Kernel Modules Version

nvidia-driver-565-open(565.57.01), nvidia-driver-570-open (570.86.15)

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

[ ] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Ubuntu 24.04.1 LTS

Kernel Release

Linux 6.11.0-1002-nvidia-64k #2-Ubuntu SMP PREEMPT_DYNAMIC Wed Oct 23 19:17:25 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

[x] I am running on a stable kernel release.

Hardware: GPU

GPU 0: NVIDIA GH200 144G HBM3e | GPU 1: NVIDIA GH200 144G HBM3e

Describe the bug

Installing ubuntu 24.04 on a new GH200 NVL2 system, using apt NVIDIA open kernel driver. Both GPUs not found when using the nvidia-driver-565-open or nvidia-driver-570-open apt package. But both GPU found when using nvidia-driver-560-open

To Reproduce

sudo apt install nvidia-driver-565-open nvidia-smi No devices were found

sudo apt install nvidia-driver-570-open nvidia-smi No devices were found

sudo apt install nvidia-driver-560-open nvidia-smi both GPUs found.

Both GPUs are at 96.00.A0.00.01 VBIOS Which should be newer than 96.00.68.00.xx https://docs.nvidia.com/datacenter/tesla/tesla-release-notes-565-57-01/index.html#known-issues

nvidia-bug-report.log.gz

Bug Incidence

Always

nvidia-bug-report.log.gz

More Info

No response

Feb 02 '25 07:02 zilingzhang

Could you also generate an nvidia-bug-report.log.gz for 560? It may help to compare logs between working and failing configurations.

Feb 02 '25 23:02 aritger

nvidia-bug-report.log.gz This is from 560.35.05, but collection hang, let me know if it's good for diagnosis

Feb 03 '25 07:02 zilingzhang

If I'm reading the log correctly, 560.35.05 has the same symptom as 570.86.15:

[    6.488488] kernel: NVRM: numa memblock size of zero found during device start
[    6.488492] kernel: [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00090100] Failed to allocate NvKmsKapiDevice
[    6.504217] kernel: [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00090100] Failed to register device

Once the GPU gets into this state, I think the problem will persist until a reboot. Could I trouble you to do the following for each of 560.35.05 and 570.86.15?

install driver
reboot
run nvidia-smi
run nvidia-bug-report.sh

Feb 03 '25 16:02 aritger

I did reboot between install, or nvidia-smi will show driver mismatch. Currently I'm running 560.35.05 (New bug report) and GPUs show up in nvidia-smi and serving llama model correctly.

I will try to perform the sequence again for both versions tonight at down time and get back to you. Thanks!

Feb 03 '25 17:02 zilingzhang

Installed 570.86.15
reboot
run nvidia-smi -> No devices were found
run nvidia-bug-report.sh

570.nvidia-bug-report.log.gz

Installed 565.57.01
reboot
run nvidia-smi -> No devices were found
run nvidia-bug-report.sh

565.nvidia-bug-report.log.gz

Installed 560
reboot
run nvidia-smi -> both GPUs were found
run nvidia-bug-report.sh

560.nvidia-bug-report.log.gz

Feb 04 '25 04:02 zilingzhang

Any update?

Apr 16 '25 17:04 liho00

We still face this issue and staying on 560 open driver...

Apr 16 '25 18:04 zilingzhang

@aritger hey nvidia, are you going to fix this bug?

Apr 19 '25 05:04 liho00

Sorry, I've been busy and not had time to look at this. Looking back at the logs in the earlier comment https://github.com/NVIDIA/open-gpu-kernel-modules/issues/774#issuecomment-2632818464, it looks like we get the message

[    6.488482] NVRM: numa memblock size of zero found during device start
[    6.488486] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00190100] Failed to allocate NvKmsKapiDevice
[    6.502476] [drm:nv_drm_register_drm_device [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00190100] Failed to register device

even in 560 where both GPUs are correct detected. So, I guess that error message is not a foothold for debugging. But, it helps to know the regression occurred between 560 and 565, so thank you for that. We'll have to get an internal reproduction to debug further.

Apr 28 '25 07:04 aritger

Thanks! Our system is https://www.supermicro.com/en/products/system/gpu/2u/ars-221gl-nhir if it helps.

May 02 '25 23:05 zilingzhang