blazingsql icon indicating copy to clipboard operation
blazingsql copied to clipboard

[BUG] Memory free leads to illegal memory access

Open jglaser opened this issue 3 years ago • 1 comments

Describe the bug

With 90x16GB workers, query 2 of NVIDIA GPU leads to this log entry and subsequent crash

2021-06-06 19:47:27.279|13|info|498234689|||MemoryMonitor about to free memory from tasks|||||
2021-06-06 19:47:27.279|13|info|498234689|||MemoryMonitor successfully freed memory from tasks|||||
2021-06-06 19:49:08.461|13|info|498234689|||MemoryMonitor about to free memory from tasks|||||
2021-06-06 19:49:08.461|13|info|498234689|||MemoryMonitor successfully freed memory from tasks|||||
2021-06-06 19:49:36.834|13|debug|498234689|8|8|Compute Aggregate Kernel tasks created|495373|kernel_id|8||
2021-06-06 19:49:37.514|13|error||||ERROR in BlazingHostTable::get_gpu_table(). What: std::bad_alloc: CUDA error at: /sw/summit/ums/gen119/nvrapids_0.19_gcc_9.3.0/include/rmm/mr/device/managed_memory_resource.hpp:73: cudaErrorIllegalAddress an illegal memory access was encountered|||||
2021-06-06 19:49:37.515|13|error||||ERROR of type rmm::bad_alloc in task::run. What: std::bad_alloc: CUDA error at: /sw/summit/ums/gen119/nvrapids_0.19_gcc_9.3.0/include/rmm/mr/device/managed_memory_resource.hpp:73: cudaErrorIllegalAddress an illegal memory access was encountered|||||
2021-06-06 19:49:37.515|13|error||||ERROR in BlazingHostTable::get_gpu_table(). What: std::bad_alloc: CUDA error at: /sw/summit/ums/gen119/nvrapids_0.19_gcc_9.3.0/include/rmm/mr/device/managed_memory_resource.hpp:73: cudaErrorIllegalAddress an illegal memory access was encountered|||||

Could this be a race condition between a kernel consuming some cache allocation which is freed by the MemoryMonitor? Is there a lock in place to prevent this?

Steps/Code to reproduce bug Run GPU BDB benchmark (SF10K) on 16 GB GPUs with --rmm-managed-memory, BLAZING_ALLOCATOR_MODE=existing, and --memory-limit 45GB

Expected behavior

No crash

Environment overview (please complete the following information) ppc64le, CUDA 11, BlazingSQL 0.19

Environment details

Additional context

jglaser avatar Jun 07 '21 00:06 jglaser

I doubt that the memory monitor is the guilty party here, considering that the memory monitor freed something 30 seconds before the illegal memory access. But this is definitely a big problem and i am looking into it

wmalpica avatar Jun 11 '21 19:06 wmalpica