cuvs icon indicating copy to clipboard operation
cuvs copied to clipboard

[BUG] Segfault after using cuvsRMMPoolMemoryResourceEnable/cuvsRMMMemoryResourceReset

Open ldematte opened this issue 2 months ago • 6 comments

Describe the bug

When using the C API cuvsRMMPoolMemoryResourceEnable/cuvsRMMMemoryResourceReset, cuvsBruteForceSearch fails with a SEGFAULT for a null pointer:

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000000

Here is the complete stack trace for the core dump. Notice there is some JVM stuff in the way (catching and rethrowing segfault), but it should be pretty clear nonetheless that this happens when cuvsBruteForceSearch tries to allocate some memory (rmm::device_buffer::allocate_async) after cuvsRMMMemoryResourceReset was called.

Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
[Current thread is 1 (Thread 0x7fbdeabff6c0 (LWP 1338041))]
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x00007fbdec30cf4f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007fbdec2bdfb2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007fbdec2a8472 in __GI_abort () at ./stdlib/abort.c:79
#4  0x00007fbdeaea4179 in os::abort(bool, void const*, void const*) [clone .cold] () from /usr/lib/jvm/jdk-24.0.1/lib/server/libjvm.so
#5  0x00007fbdebbc3718 in VMError::report_and_die(int, char const*, char const*, __va_list_tag*, Thread*, unsigned char*, void const*, void const*, char const*, int, unsigned long) ()
   from /usr/lib/jvm/jdk-24.0.1/lib/server/libjvm.so
#6  0x00007fbdebbc3eab in VMError::report_and_die(Thread*, unsigned int, unsigned char*, void const*, void const*, char const*, ...) () from /usr/lib/jvm/jdk-24.0.1/lib/server/libjvm.so
#7  0x00007fbdebbc3ece in VMError::report_and_die(Thread*, unsigned int, unsigned char*, void const*, void const*) () from /usr/lib/jvm/jdk-24.0.1/lib/server/libjvm.so
#8  0x00007fbdeba26f10 in JVM_handle_linux_signal () from /usr/lib/jvm/jdk-24.0.1/lib/server/libjvm.so
#9  <signal handler called>
#10 0x00007fbdc0794f14 in void* cuda::mr::__4::_Resource_vtable_builder::_Alloc_async<rmm::mr::device_memory_resource>(void*, unsigned long, unsigned long, cuda::__4::stream_ref) ()
   from /home/ldematte/miniconda3/envs/cuvs-25-12/lib/libcuvs_c.so
#11 0x00007fbdc16f8692 in rmm::device_buffer::allocate_async(unsigned long) () from /home/ldematte/miniconda3/envs/cuvs-25-12/lib/librmm.so
#12 0x00007fbdc16f86fa in rmm::device_buffer::device_buffer(unsigned long, rmm::cuda_stream_view, rmm::detail::cccl_async_resource_ref<cuda::mr::__4::basic_resource_ref<(cuda::mr::__4::_AllocType)1, cuda::mr::__4::device_accessi--Type <RET> for more, q to quit, c to continue without paging--                                                                                                                                                                    
ble> >) () from /home/ldematte/miniconda3/envs/cuvs-25-12/lib/librmm.so                                                                                                                                                             
#13 0x00007fbd65d5d39f in void cuvs::neighbors::detail::tiled_brute_force_knn<float, long, float, raft::identity_op>(raft::resources const&, float const*, float const*, unsigned long, unsigned long, unsigned long, unsigned long, float*, long*, cuvsDistanceType, float, unsigned long, unsigned long, float const*, float const*, unsigned int const*, raft::identity_op, cuvs::neighbors::filtering::FilterType) [clone .constprop.0] ()                          
   from /home/ldematte/miniconda3/envs/cuvs-25-12/lib/libcuvs.so
#14 0x00007fbd65dbc9c6 in void cuvs::neighbors::detail::brute_force_search_filtered<float, long, unsigned int, float>(raft::resources const&, cuvs::neighbors::brute_force::index<float, float> const&, std::experimental::mdspan<float const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >, cuvs::neighbors::filtering::base_filter const*, std::experimental::mdspan<long, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<long>, (raft::memory_type)2> >, std::experimental::mdspan<float, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float>, (raft::memory_type)2> >, std::optional<std::experimental::mdspan<float const, std::experimental::extents<long, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> > >) () from /home/ldematte/miniconda3/envs/cuvs-25-12/lib/libcuvs.so                                                                  
#15 0x00007fbd65dbeb0f in void cuvs::neighbors::detail::search<float, long, float, std::experimental::layout_right>(raft::resources const&, cuvs::neighbors::brute_force::index<float, float> const&, std::experimental::mdspan<float const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >, std::experimental::mdspan<long, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<long>, (raft::memory_type)2> >, std::experimental::mdspan<float, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float>, (raft::memory_type)2> >, cuvs::neighbors::filtering::base_filter const&) () from /home/ldematte/miniconda3/envs/cuvs-25-12/lib/libcuvs.so                                                                                        
--Type <RET> for more, q to quit, c to continue without paging--
#16 0x00007fbdc07b2361 in cuvsBruteForceSearch::{lambda()#1}::operator()() const () from /home/ldematte/miniconda3/envs/cuvs-25-12/lib/libcuvs_c.so
#17 0x00007fbdc07b339c in cuvsBruteForceSearch () from /home/ldematte/miniconda3/envs/cuvs-25-12/lib/libcuvs_c.so

Steps/Code to reproduce bug

Checkout https://github.com/rapidsai/cuvs/pull/1453 and run java tests (cd cuvs/java/cuvs-java && mvn clean verify) I'll try to repro this with C code (cuvsRMMPoolMemoryResourceEnable + cuvsRMMMemoryResourceReset + cuvsBruteForceSearch) when/if I have time, but I might not be able to do it.

Environment details (please complete the following information):

  • Environment location: Bare-metal
  • Method of RAFT install: from source (main)

Additional context

If GH allows it, I can attach the Java hs_err file and/or the core dump

ldematte avatar Oct 23 '25 09:10 ldematte

@benfred let me know if you need/want a core dump

ldematte avatar Oct 27 '25 11:10 ldematte

@benfred I was able to reproduce this just by using the C API. I have taken c/tests/neighbors/run_brute_force_c.c and modified it slightly, mainly to be self-contained.

I then added 2 lines at the beginning:

cuvsRMMPoolMemoryResourceEnable(10, 60, false);
cuvsRMMMemoryResourceReset();

Just turning on RMM pooling and then turning it off is enough to make cuvsBruteForceSearch segfault:

#0  0x00007f2fcc574154 in void* cuda::mr::__4::_Resource_vtable_builder::_Alloc_async<rmm::mr::device_memory_resource>(void*, unsigned long, unsigned long, cuda::__4::stream_ref) () from /home/ldematte/miniconda3/envs/cuvs-25-12/lib/libcuvs_c.so
#1  0x00007f2fcc514692 in rmm::device_buffer::allocate_async(unsigned long) () from /home/ldematte/miniconda3/envs/cuvs-25-12/lib/librmm.so
#2  0x00007f2fcc5146fa in rmm::device_buffer::device_buffer(unsigned long, rmm::cuda_stream_view, rmm::detail::cccl_async_resource_ref<cuda::mr::__4::basic_resource_ref<(cuda::mr::__4::_AllocType)1, cuda::mr::__4::device_accessible> >) ()
   from /home/ldematte/miniconda3/envs/cuvs-25-12/lib/librmm.so
#3  0x00007f2fcc515411 in rmm::device_buffer::resize(unsigned long, rmm::cuda_stream_view) ()
   from /home/ldematte/miniconda3/envs/cuvs-25-12/lib/librmm.so
#4  0x00007f2fc18a26a6 in void cuvs::neighbors::detail::fusedL2Knn<long, float, false, float>(unsigned long, long*, float*, float const*, float const*, unsigned long, unsigned long, int, bool, bool, CUstream_st*, cuvs::distance::DistanceType, float const*, float const*) ()
   from /home/ldematte/miniconda3/envs/cuvs-25-12/lib/libcuvs.so
#5  0x00007f2fc1ba1a25 in void cuvs::neighbors::detail::brute_force_knn_impl<long, long, float, float>(raft::resources const&, std::vector<float*, std::allocator<float*> >&, std::vector<long, std::allocator<long> >&, long, float*, long, long*, float*, long, bool, bool, std::vector<long, std::allocator<long> >*, cuvs::distance::DistanceType, float, std::vector<float*, std::allocator<float*> >*, float const*) ()
   from /home/ldematte/miniconda3/envs/cuvs-25-12/lib/libcuvs.so
#6  0x00007f2fc1bd7fec in void cuvs::neighbors::detail::search<float, long, float, std::experimental::layout_right>(raft::resources const&, cuvs::neighbors::brute_force::index<float, float> const&, std::experimental::mdspan<float const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >, std::experimental::mdspan<long, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<long>, (raft::memory_type)2> >, std::experimental::mdspan<float, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float>, (raft::memory_type)2> >, cuvs::neighbors::filtering::base_filter const&) ()
   from /home/ldematte/miniconda3/envs/cuvs-25-12/lib/libcuvs.so
#7  0x00007f2fcc591d2a in cuvsBruteForceSearch::{lambda()#1}::operator()() const ()
   from /home/ldematte/miniconda3/envs/cuvs-25-12/lib/libcuvs_c.so
#8  0x00007f2fcc592cdc in cuvsBruteForceSearch () from /home/ldematte/miniconda3/envs/cuvs-25-12/lib/libcuvs_c.so
#9  0x000055b87d902289 in main ()

bruteforce.c

ldematte avatar Nov 10 '25 14:11 ldematte

BTW, cuvsTieredIndexBuild suffers from the same problem. Same stack trace with device_buffer and allocate_async

ldematte avatar Nov 10 '25 14:11 ldematte

Hey @benfred, I think I found the issue. In cuvsRMMMemoryResourceReset, we call rmm::mr::set_current_device_resource(nullptr); but I think this is incorrect.

If you look at the code that is failing (https://github.com/rapidsai/rmm/blob/f965b7f98805e96ccd4fb3e7774f9b8e38ad2bdb/cpp/src/device_buffer.cpp#L90, via device_buffer::resize -> device_buffer ctor), it appears that it never expects the rmm device resource to be null. I looked at other usages of set_current_device_resource, and none is passing null.

I think we should either save the original resource we get inside cuvsRMMPoolMemoryResourceEnable (returning some state, e.g. via a new cuvsRMMMemoryResource_t -- which may be get rid of the thread local too, if we find it convenient), and pass it back to cuvsRMMMemoryResourceReset so it can restore it.

Or maybe simpler, use the new rmm::mr::reset_current_device_resource_ref() https://github.com/rapidsai/rmm/blob/f965b7f98805e96ccd4fb3e7774f9b8e38ad2bdb/cpp/include/rmm/mr/per_device_resource.hpp#L466

I'm currently trying out the second one, I'll let you know how it goes.

ldematte avatar Nov 14 '25 10:11 ldematte

OK, apparently there is a new set of RMM functions that do not expose raw pointers but "ref" objects, and they are not interchangeable. So using rmm::mr::reset_current_device_resource_ref() will not work (we'd have to change all the *_device_resource functions to use the new *_device_resource_ref counterparts).

But changing to

rmm::mr::set_current_device_resource(rmm::mr::detail::initial_resource());

Works as expected and should be equivalent. With this change, my C repro does not crash anymore and works. I will double check, run the java tests too, and if everything works I'll raise a PR with the fix.

ldematte avatar Nov 14 '25 12:11 ldematte