[BUG] cudf::left_anti_join fails with a signal error (SIGABRT) instead of throwing an exception when there is an OOM condition
Describe the bug
When an out-of-memory (OOM) condition occurs, cudf::left_anti_join fails with a signal error (SIGABRT) instead of throwing an appropriate exception (std::bad_alloc).
Steps/Code to reproduce bug
TEST(CudfTest, LeftAntiJoinOOM) {
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource();
auto pool_mr = std::make_shared<rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>>(mr, 256, 2560);
rmm::mr::set_current_device_resource(pool_mr.get());
auto make_table = [](int32_t size, int32_t start) -> std::unique_ptr<cudf::table> {
auto sequence_column = cudf::sequence(size, cudf::numeric_scalar<int32_t>(start));
std::vector<std::unique_ptr<cudf::column>> columns;
columns.push_back(std::move(sequence_column));
return std::make_unique<cudf::table>(std::move(columns));
};
try {
auto left = make_table(64, 0);
auto right = make_table(128, 50);
std::cerr << "left size: " << left->num_rows() << ", right size: " << right->num_rows() << "\n";
std::unique_ptr<rmm::device_uvector<cudf::size_type>> left_indices =
cudf::left_anti_join(left->view(), right->view());
std::cerr << "done left_anti_join " << "\n";
} catch(const std::exception& e) {
std::cerr << "Caught exception: " << e.what() << "\n";
}
}
left size: 64, right size: 128
terminate called after throwing an instance of 'rmm::out_of_memory'
what(): std::bad_alloc: out_of_memory: RMM failure at:/home/alexander/envs/theseus_dev/include/rmm/mr/device/pool_memory_resource.hpp:313: Maximum pool size exceeded
Aborted (core dumped)
Running this test produces a SIGABRT (Abort signal) instead of catching a std::bad_alloc exception:
Expected behavior
The function should throw a std::bad_alloc exception which can be caught and handled gracefully, instead of terminating the program with a signal error.
Environment details
Method of cuDF install: source code v24.06.00 branch release
Additional context
After debugging the internal functions utilized in cudf::left_anti_join, I determined that the cudf::detail::contains call is failing.
https://github.com/rapidsai/cudf/blob/c83e5b3fdd7f9fe8a08c4f6874fbf847bba70c53/cpp/src/join/semi_join.cu#L70
https://github.com/rapidsai/cudf/blob/c83e5b3fdd7f9fe8a08c4f6874fbf847bba70c53/cpp/include/cudf/detail/search.hpp#L95
This is, I think, a bug in cuco, the left anti join uses cudf::detail::contains, which uses a cuco::static_set whose implementation type is cuco::detail::open_addressing_impl. That object's constructor allocates space for storage, but is marked as noexcept, which is incorrect.
cc @GregoryKimball
Unfortunately this isn't something that can effectively be recovered from. Understood the underlying issue is in cuCollections, but once there's a fix would this be something worth a 24.06.01 hotfix release with a patched cuCollections?
A solution to this issue is available in cuCollections here https://github.com/NVIDIA/cuCollections/commit/1f09fa9b8c5c846511589b76cae0d585c8cf965a.
The issue cannot be effectively recovered from, and it is worth considering a hotfix release with a patched cuCollections, as the fix is already available.
The cuco fix was placed ToT. Unfortunately this branch includes some breaking changes for rapids (which have already been addressed for the 24.08 release).
I can provide a bugfix branch for the branch that is used in 24.06 in case you consider it worth a hotfix release for rapids.
The cuco fix was placed ToT. Unfortunately this branch includes some breaking changes for rapids (which have already been addressed for the 24.08 release).
I can provide a bugfix branch for the branch that is used in 24.06 in case you consider it worth a hotfix release for rapids.
That would be good @sleeepyjack if you could! It would be helpful to get a hotfix release with this patched.
I have a fix up in #16077.