open-gpu-kernel-modules
open-gpu-kernel-modules copied to clipboard
Rocky Linux 8.8 crash
NVIDIA Open GPU Kernel Modules Version
525.47.04
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
- [X] I confirm that this does not happen with the proprietary driver package.
Operating System and Version
Rocky LInux 8.8
Kernel Release
linux-4.18.0-477.10.1.el8_8
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- [X] I am running on a stable kernel release.
Hardware: GPU
GPU 0:Tesla T4
Describe the bug
Pin and Unpin pages lead to host crash. Nvidia driver will lock user pages by cuda API such as memory registry. If pin pages failed by pin_user_pages for some reasons,it pin page using get_page which inc ref by 1. However, driver will call unpin_user_page when unlock user pages regardless of page type,in this case, page ref will be decreased by 1024 not 1. Host crashed because page ref less than zero. it is reasonnable to record pages that pinned failed but inc by 1,then call put_page other than unpin_user_pages.
To Reproduce
- Map a page to file which is a special one.
- .register this page by CUDA API.
- Run TF benchmarks.
Bug Incidence
Always
nvidia-bug-report.log.gz
More Info
No response