arrow icon indicating copy to clipboard operation
arrow copied to clipboard

[C++] Crashed at TempStack alloc when use Hashing32::HashBatch independently

Open ZhangHuiGui opened this issue 11 months ago • 4 comments

Describe the bug, including details regarding any error messages, version, and platform.

The issue is similar to https://github.com/apache/arrow/pull/40007, but they are different. I want to use the Hashing32::HashBatch api for produce a hash-array for a batch. Although the Hashing32 and Hashing64 are used in join based codes, but they can be used independently.

Like below codes:

  auto arr = arrow::ArrayFromJSON(arrow::int32(), "[9,2,6]");
  const int batch_len = arr->length();
  arrow::compute::ExecBatch exec_batch({arr}, batch_len);
  auto ctx = arrow::compute::default_exec_context();
  arrow::util::TempVectorStack stack;
  ASSERT_OK(stack.Init(ctx->memory_pool(), batch_len * sizeof(uint32_t))); // I just alloc the stack size as i needed.

  std::vector<uint32_t> hashes(batch_len);
  std::vector<arrow::compute::KeyColumnArray> temp_column_arrays;
  ASSERT_OK(arrow::compute::Hashing32::HashBatch(
      exec_batch, hashes.data(), temp_column_arrays,
      ctx->cpu_info()->hardware_flags(), &stack, 0, batch_len));

The crash stack in HashBatch is:

arrow::compute::Hashing32::HashBatch
  arrow::compute::Hashing32::HashMultiColumn
      arrow::util::TempVectorHolder<unsigned int>::TempVectorHolder
        arrow::util::TempVectorStack::alloc
          ARROW_DCHECK(top_ <= buffer_size_); // top_=4176, buffer_size_=160

The reason is blow codes: https://github.com/apache/arrow/blob/7e286dd004a8fcf2de0f58615793338076741208/cpp/src/arrow/compute/key_hash.cc#L385-L387

The holder use the max_batch_size which is 1024 as it's num_elements, it's far more than the temp stack's init buffer_size.

I know that the HashBatch is only used in hash-join or related codes. For join, they have already done line clipping at the upper level, ensuring that each input batch size is less_equal to kMiniBatchLength and the stack size is bigger enough.

But it can be used independently. So maybe we could use the num_rows rather than util::MiniBatch::kMiniBatchLength in HashBatch related apis?

Component(s)

C++

ZhangHuiGui avatar Mar 09 '24 07:03 ZhangHuiGui

cc @kou PTAL this issue?

ZhangHuiGui avatar Mar 09 '24 07:03 ZhangHuiGui

Can you provide a buildable C++ code that reproduces this problem?

kou avatar Mar 09 '24 08:03 kou

Can you provide a buildable C++ code that reproduces this problem?

Of course.

#include <arrow/compute/exec.h>
#include <arrow/compute/util.h>
#include <arrow/testing/gtest_util.h>
#include <arrow/testing/random.h>
#include <arrow/type_fwd.h>
#include <arrow/compute/light_array.h>
#include <arrow/compute/key_hash.h>
#include <arrow/util/async_util.h>
#include <arrow/util/future.h>
#include <arrow/util/task_group.h>
#include <arrow/util/thread_pool.h>
#include <arrow/util/logging.h>
#include <arrow/acero/options.h>
#include <arrow/compute/api_vector.h>
#include <arrow/memory_pool.h>
#include <arrow/record_batch.h>
#include <arrow/builder.h>
#include <arrow/result.h>
#include <arrow/array/diff.h>

#include <mutex>
#include <thread>
#include <unordered_map>
#include "gtest/gtest.h"

TEST(HashBatch, BasicTest) {
  auto arr = arrow::ArrayFromJSON(arrow::int32(), "[9,2,6]");
  const int batch_len = arr->length();
  arrow::compute::ExecBatch exec_batch({arr}, batch_len);
  auto ctx = arrow::compute::default_exec_context();
  arrow::util::TempVectorStack stack;
  ASSERT_OK(stack.Init(ctx->memory_pool(), batch_len * sizeof(uint32_t)));

  std::vector<uint32_t> hashes(batch_len);
  std::vector<arrow::compute::KeyColumnArray> temp_column_arrays;
  ASSERT_OK(arrow::compute::Hashing32::HashBatch(
      exec_batch, hashes.data(), temp_column_arrays,
      ctx->cpu_info()->hardware_flags(), &stack, 0, batch_len));

  for (int i = 0; i < batch_len; i++) {
    std::cout << hashes[i] << " ";
  }
}

cc @kou, do you think this problem needs to be solved?

ZhangHuiGui avatar Mar 09 '24 08:03 ZhangHuiGui

Sorry. I missed this.

Thanks. I could run the code:

diff --git a/cpp/src/arrow/compute/key_hash_test.cc b/cpp/src/arrow/compute/key_hash_test.cc
index c998df7169..ccfddaa645 100644
--- a/cpp/src/arrow/compute/key_hash_test.cc
+++ b/cpp/src/arrow/compute/key_hash_test.cc
@@ -311,5 +311,24 @@ TEST(VectorHash, FixedLengthTailByteSafety) {
   HashFixedLengthFrom(/*key_length=*/19, /*num_rows=*/64, /*start_row=*/63);
 }
 
+TEST(HashBatch, BasicTest) {
+  auto arr = arrow::ArrayFromJSON(arrow::int32(), "[9,2,6]");
+  const int batch_len = arr->length();
+  arrow::compute::ExecBatch exec_batch({arr}, batch_len);
+  auto ctx = arrow::compute::default_exec_context();
+  arrow::util::TempVectorStack stack;
+  ASSERT_OK(stack.Init(ctx->memory_pool(), batch_len * sizeof(uint32_t)));
+
+  std::vector<uint32_t> hashes(batch_len);
+  std::vector<arrow::compute::KeyColumnArray> temp_column_arrays;
+  ASSERT_OK(arrow::compute::Hashing32::HashBatch(
+      exec_batch, hashes.data(), temp_column_arrays,
+      ctx->cpu_info()->hardware_flags(), &stack, 0, batch_len));
+
+  for (int i = 0; i < batch_len; i++) {
+    std::cout << hashes[i] << " ";
+  }
+}
+
 }  // namespace compute
 }  // namespace arrow

In general, allocating only required size is preferred. So using the num_rows rather than util::MiniBatch::kMiniBatchLength in HashBatch related apis may be better. (Sorry, I can't determine whether the num_rows is the correct size for it right now.)

But it seems that stack.Init(ctx->memory_pool(), batch_len * sizeof(uint32_t)) isn't enough for this case. Because we need at least 3 allocations for HashMultiColumn():

https://github.com/apache/arrow/blob/605f8a792c388afb2230b1f19e0f3e4df90d5abe/cpp/src/arrow/compute/key_hash.cc#L387-L395

And we also need 16 bytes metadata:

https://github.com/apache/arrow/blob/605f8a792c388afb2230b1f19e0f3e4df90d5abe/cpp/src/arrow/compute/util.cc#L35-L44

Anyway, could you try this?

kou avatar Mar 11 '24 23:03 kou

Issue resolved by pull request 40484 https://github.com/apache/arrow/pull/40484

pitrou avatar Apr 02 '24 16:04 pitrou