OpenMLDB Core when enabling UnsafeRowOpt with multiple threads in Yarn cluster

Core when enabling UnsafeRowOpt with multiple threads in Yarn cluster

Open tobegit3hub opened this issue 1 year ago • 1 comments

Get the core dump when enabling UnsafeRowOpt with multiple threads.

Aug 05 '22 02:08 tobegit3hub

The bytes array are corrupt and get unexpected offset.

Aug 09 '22 11:08 tobegit3hub

We can simply the case and it happens when we are using distinct_count for string columns. Here is the simplest case.

select
    distinct_count(`publisher_id`) over flattenRequest_userId_eventTime_0s_2d_1000 as flattenRequest_publisher_id_window_unique_count_60
from
    `flattenRequest`
window
    flattenRequest_userId_eventTime_0s_2d_1000 as (partition by `userId` order by `eventTime` rows_range between 172799999 preceding and 0s preceding MAXSIZE 1000)

And we found that the implementation of distinct_count which uses std::unordered_set is not thread-safe.

Screen Shot 2022-08-10 at 15 02 32

Aug 11 '22 03:08 tobegit3hub

It is not related with distinct_count and we may get the incorrect for other udaf like count which may not crush.

This issue can be reproduced for large window and multiple threads with enough data. We add some debug code and found that the row pointer is corrupt or invalid when accessing the previous rows in window.

It may be the issue of row pointers in buffered window list. We pass the Spark DirectByteBuffer as row pointer and it may be released by JVM after current row's computation. We have to find another to manage the memory for Spark DirectByteBuffer which may be used in other row's computation later.

Aug 16 '22 04:08 tobegit3hub

OpenMLDB OpenMLDB copied to clipboard

Core when enabling UnsafeRowOpt with multiple threads in Yarn cluster

OpenMLDB
OpenMLDB copied to clipboard