cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[BUG] cudf::left_anti_join fails with a signal error (SIGABRT) instead of throwing an exception when there is an OOM condition

Open aocsa opened this issue 1 year ago • 6 comments

Describe the bug

When an out-of-memory (OOM) condition occurs, cudf::left_anti_join fails with a signal error (SIGABRT) instead of throwing an appropriate exception (std::bad_alloc).

Steps/Code to reproduce bug

TEST(CudfTest, LeftAntiJoinOOM) {
  rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource();
  auto pool_mr = std::make_shared<rmm::mr::pool_memory_resource<rmm::mr::device_memory_resource>>(mr, 256, 2560);
  rmm::mr::set_current_device_resource(pool_mr.get());

  auto make_table = [](int32_t size, int32_t start) -> std::unique_ptr<cudf::table> {
    auto sequence_column = cudf::sequence(size, cudf::numeric_scalar<int32_t>(start));

    std::vector<std::unique_ptr<cudf::column>> columns;
    columns.push_back(std::move(sequence_column));
    return std::make_unique<cudf::table>(std::move(columns));
  };

  try {
    auto left = make_table(64, 0);
    auto right = make_table(128, 50);

    std::cerr << "left size: " << left->num_rows() << ", right size: " << right->num_rows() << "\n";
    std::unique_ptr<rmm::device_uvector<cudf::size_type>> left_indices =
        cudf::left_anti_join(left->view(), right->view());

    std::cerr << "done left_anti_join " << "\n";

  } catch(const std::exception& e) {
    std::cerr << "Caught exception: " << e.what() << "\n";
  }
}
left size: 64, right size: 128
terminate called after throwing an instance of 'rmm::out_of_memory'
  what():  std::bad_alloc: out_of_memory: RMM failure at:/home/alexander/envs/theseus_dev/include/rmm/mr/device/pool_memory_resource.hpp:313: Maximum pool size exceeded
Aborted (core dumped)

Running this test produces a SIGABRT (Abort signal) instead of catching a std::bad_alloc exception:

Expected behavior

The function should throw a std::bad_alloc exception which can be caught and handled gracefully, instead of terminating the program with a signal error.

Environment details

Method of cuDF install: source code v24.06.00 branch release

Additional context

After debugging the internal functions utilized in cudf::left_anti_join, I determined that the cudf::detail::contains call is failing.

https://github.com/rapidsai/cudf/blob/c83e5b3fdd7f9fe8a08c4f6874fbf847bba70c53/cpp/src/join/semi_join.cu#L70

https://github.com/rapidsai/cudf/blob/c83e5b3fdd7f9fe8a08c4f6874fbf847bba70c53/cpp/include/cudf/detail/search.hpp#L95

aocsa avatar Jun 18 '24 20:06 aocsa

This is, I think, a bug in cuco, the left anti join uses cudf::detail::contains, which uses a cuco::static_set whose implementation type is cuco::detail::open_addressing_impl. That object's constructor allocates space for storage, but is marked as noexcept, which is incorrect.

wence- avatar Jun 19 '24 10:06 wence-

cc @GregoryKimball

Unfortunately this isn't something that can effectively be recovered from. Understood the underlying issue is in cuCollections, but once there's a fix would this be something worth a 24.06.01 hotfix release with a patched cuCollections?

kkraus14 avatar Jun 19 '24 15:06 kkraus14

A solution to this issue is available in cuCollections here https://github.com/NVIDIA/cuCollections/commit/1f09fa9b8c5c846511589b76cae0d585c8cf965a.

The issue cannot be effectively recovered from, and it is worth considering a hotfix release with a patched cuCollections, as the fix is already available.

aocsa avatar Jun 21 '24 01:06 aocsa

The cuco fix was placed ToT. Unfortunately this branch includes some breaking changes for rapids (which have already been addressed for the 24.08 release).

I can provide a bugfix branch for the branch that is used in 24.06 in case you consider it worth a hotfix release for rapids.

sleeepyjack avatar Jun 21 '24 03:06 sleeepyjack

The cuco fix was placed ToT. Unfortunately this branch includes some breaking changes for rapids (which have already been addressed for the 24.08 release).

I can provide a bugfix branch for the branch that is used in 24.06 in case you consider it worth a hotfix release for rapids.

That would be good @sleeepyjack if you could! It would be helpful to get a hotfix release with this patched.

cryos avatar Jun 24 '24 20:06 cryos

I have a fix up in #16077.

vyasr avatar Jun 24 '24 23:06 vyasr