blazingsql
blazingsql copied to clipboard
[BUG] Memory free leads to illegal memory access
Describe the bug
With 90x16GB workers, query 2 of NVIDIA GPU leads to this log entry and subsequent crash
2021-06-06 19:47:27.279|13|info|498234689|||MemoryMonitor about to free memory from tasks|||||
2021-06-06 19:47:27.279|13|info|498234689|||MemoryMonitor successfully freed memory from tasks|||||
2021-06-06 19:49:08.461|13|info|498234689|||MemoryMonitor about to free memory from tasks|||||
2021-06-06 19:49:08.461|13|info|498234689|||MemoryMonitor successfully freed memory from tasks|||||
2021-06-06 19:49:36.834|13|debug|498234689|8|8|Compute Aggregate Kernel tasks created|495373|kernel_id|8||
2021-06-06 19:49:37.514|13|error||||ERROR in BlazingHostTable::get_gpu_table(). What: std::bad_alloc: CUDA error at: /sw/summit/ums/gen119/nvrapids_0.19_gcc_9.3.0/include/rmm/mr/device/managed_memory_resource.hpp:73: cudaErrorIllegalAddress an illegal memory access was encountered|||||
2021-06-06 19:49:37.515|13|error||||ERROR of type rmm::bad_alloc in task::run. What: std::bad_alloc: CUDA error at: /sw/summit/ums/gen119/nvrapids_0.19_gcc_9.3.0/include/rmm/mr/device/managed_memory_resource.hpp:73: cudaErrorIllegalAddress an illegal memory access was encountered|||||
2021-06-06 19:49:37.515|13|error||||ERROR in BlazingHostTable::get_gpu_table(). What: std::bad_alloc: CUDA error at: /sw/summit/ums/gen119/nvrapids_0.19_gcc_9.3.0/include/rmm/mr/device/managed_memory_resource.hpp:73: cudaErrorIllegalAddress an illegal memory access was encountered|||||
Could this be a race condition between a kernel consuming some cache allocation which is freed by the MemoryMonitor? Is there a lock in place to prevent this?
Steps/Code to reproduce bug
Run GPU BDB benchmark (SF10K) on 16 GB GPUs with --rmm-managed-memory
, BLAZING_ALLOCATOR_MODE=existing
, and --memory-limit 45GB
Expected behavior
No crash
Environment overview (please complete the following information) ppc64le, CUDA 11, BlazingSQL 0.19
Environment details
Additional context
I doubt that the memory monitor is the guilty party here, considering that the memory monitor freed something 30 seconds before the illegal memory access. But this is definitely a big problem and i am looking into it