[BUG] SYSTEM_MR_TEST fails on some systems with HMM enabled
Describe the bug Nightlies failed here: https://github.com/rapidsai/rmm/actions/runs/15409284930
One test that seems to be failing more consistently is:
The following tests FAILED:
54 - SYSTEM_MR_TEST (Failed)
55 - SYSTEM_MR_PTDS_TEST (Failed)
Currently the best guess of a root cause is that the CI runners base OS was upgraded to Ubuntu 24.04. This appears to have triggered some changes in behavior. My current belief is that the failing tests may need to run in serial (by requesting 100% of the GPU in the ctest configuration).
However, there are some other failure modes observed, too:
[ RUN ] ResourceTests/mr_ref_test.SetCurrentDeviceResourceRef/System
/tmp/conda-bld-output/bld/rattler-build_librmm/work/cpp/tests/mr/device/mr_ref_test.hpp:79: Failure
Value of: is_device_accessible_memory(ptr)
Actual: false
Expected: true
[ FAILED ] ResourceTests/mr_ref_test.SetCurrentDeviceResourceRef/System, where GetParam() = "System" (0 ms)
I will investigate further and update this issue.
I think there are two problems. One has to do with parallel tests. That appears to be fixed by the draft changes in #1944. There is a separate problem as noted above. I think the problem can be described as follows. On CUDA 11, a system memory resource is used to make an allocation. This allocation does not appear to be device-accessible, which seems to be a regression from past behavior.
We've had some changes in the CI infrastructure recently, so I'm not sure what to blame yet.
Here is a bit more debugging information:
[ RUN ] ResourceTests/mr_ref_test.SetCurrentDeviceResourceRef/System
cudaPointerGetAttributes succeeded for ptr=0x59cec410a000, devicePointer: 0
is_device_accessible_memory pointer null?: false
/tmp/conda-bld-output/bld/rattler-build_librmm/work/cpp/tests/mr/device/mr_ref_test.hpp:79: Failure
Value of: is_device_accessible_memory(ptr)
Actual: false
Expected: true
[ FAILED ] ResourceTests/mr_ref_test.SetCurrentDeviceResourceRef/System, where GetParam() = "System" (0 ms)
On 25.08, we have dropped CUDA 11, so this problem won't be seen in 25.08. I think it is worth studying this a little further just to make sure we understand the root cause, in case something should change for CUDA 12+.
It looks like this test is being skipped when I run the test suite on CUDA 11 locally:
[ RUN ] ResourceTests/mr_ref_test.SetCurrentDeviceResourceRef/System
/home/coder/rmm/cpp/tests/mr/device/mr_ref_test.hpp:471: Skipped
Skipping tests since the memory resource is not supported with this CUDA driver/runtime version
[ SKIPPED ] ResourceTests/mr_ref_test.SetCurrentDeviceResourceRef/System (302 ms)
Perhaps there is an issue in detecting the driver or runtime version.
The only significant system change is in the Linux kernel version going from 5.15.0 to 6.8.0. I think that RMM concludes that system memory is supported now, but it does not work?
HMM support appears to be enabled now, while it was disabled previously. CI shows:
$ nvidia-smi -q | grep Addressing
Addressing Mode : HMM
I will work to get a local reproducer of this and determine a path forward.
It seems to affect both CUDA 11 and some CUDA 12 jobs (example failure on CUDA 12.2, amd64, rockylinux8, L4, earliest driver).
I filed #1950 for now to unblock CI. I am going to continue doing some investigation locally and possibly on #1944, but getting CI unblocked is sufficient for now.
Next step: file a follow up to #1950 and only skip the affected test on HMM systems with affected drivers (possibly earlier than 565?).
Closing this as complete with the workaround in #1944. See that PR for more information.