open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

potential bugs regarding to RMAPI_GPU_LOCK_INTERNAL usage in _createOrReuseVidmemInfoPersistent

Open legezywzh opened this issue 8 months ago • 1 comments

NVIDIA Open GPU Kernel Modules Version

565.57.01-p2p

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [x] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Description: Ubuntu 22.04.5 LTS

Kernel Release

Linux jmkernel 5.15.0-126-generic #136-Ubuntu SMP Wed Nov 6 10:38:22 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [x] I am running on a stable kernel release.

Hardware: GPU

GPU 0: NVIDIA A40 (UUID: GPU-f1654204-ae9d-31d9-da35-2e59c60cd8e4)

Describe the bug

=> rm_p2p_get_pages_persistent // call rmapiLockAcquire() to acquire API lock. ==> RmP2PGetPagesPersistent ===> > _createOrReuseVidmemInfoPersistent

at the beginning of _createOrReuseVidmemInfoPersistent(), there are codes: RM_API *pRmApi = rmapiGetInterface(RMAPI_GPU_LOCK_INTERNAL); and RMAPI_GPU_LOCK_INTERNAL, // For clients that have TLS, API lock, and GPU lock -- security is RM internal

IIUC, once RMAPI_GPU_LOCK_INTERNAL is used, API lock and GPU lock will be considered to have been acquired,but look at codes, it seems that before calling _createOrReuseVidmemInfoPersistent, only API lock have been acquired, but GPU lock is not acquired.

so I wonder whether it should be modified to RM_API *pRmApi = rmapiGetInterface(RMAPI_API_LOCK_INTERNAL);

Thanks for your time.

To Reproduce

It maybe not a bug, currently found this issue by reading codes.

Bug Incidence

Once

nvidia-bug-report.log.gz

It maybe not a bug, currently found this issue by reading codes.

More Info

No response

legezywzh avatar Apr 02 '25 11:04 legezywzh

Hey there. Thanks, this is definitely looking like a bug, although I'm not sure the suggested fix is correct; more likely the API needs to take the GPU lock too.

We have some tooling to detect these but it was generating plenty of false positives so we never enabled it. Maybe it's time to resurrect it.

mtijanic avatar Apr 04 '25 11:04 mtijanic