open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

Rocky Linux 8.8 crash

Open neo-v opened this issue 1 year ago • 3 comments

NVIDIA Open GPU Kernel Modules Version

525.47.04

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [X] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Rocky LInux 8.8

Kernel Release

linux-4.18.0-477.10.1.el8_8

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [X] I am running on a stable kernel release.

Hardware: GPU

GPU 0:Tesla T4

Describe the bug

Pin and Unpin pages lead to host crash. Nvidia driver will lock user pages by cuda API such as memory registry. If pin pages failed by pin_user_pages for some reasons,it pin page using get_page which inc ref by 1. However, driver will call unpin_user_page when unlock user pages regardless of page type,in this case, page ref will be decreased by 1024 not 1. Host crashed because page ref less than zero. it is reasonnable to record pages that pinned failed but inc by 1,then call put_page other than unpin_user_pages.

To Reproduce

  1. Map a page to file which is a special one.
  2. .register this page by CUDA API.
  3. Run TF benchmarks.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

neo-v avatar Oct 11 '23 12:10 neo-v