OpenMLDB icon indicating copy to clipboard operation
OpenMLDB copied to clipboard

Core when enabling UnsafeRowOpt with multiple threads in Yarn cluster

Open tobegit3hub opened this issue 1 year ago • 1 comments

Get the core dump when enabling UnsafeRowOpt with multiple threads.

119469f63205147a96a6be8a655cdb44

tobegit3hub avatar Aug 05 '22 02:08 tobegit3hub

The bytes array are corrupt and get unexpected offset.

image

image

tobegit3hub avatar Aug 09 '22 11:08 tobegit3hub

We can simply the case and it happens when we are using distinct_count for string columns. Here is the simplest case.

select
    distinct_count(`publisher_id`) over flattenRequest_userId_eventTime_0s_2d_1000 as flattenRequest_publisher_id_window_unique_count_60
from
    `flattenRequest`
window
    flattenRequest_userId_eventTime_0s_2d_1000 as (partition by `userId` order by `eventTime` rows_range between 172799999 preceding and 0s preceding MAXSIZE 1000)

And we found that the implementation of distinct_count which uses std::unordered_set is not thread-safe.

Screen Shot 2022-08-10 at 15 02 32

tobegit3hub avatar Aug 11 '22 03:08 tobegit3hub

It is not related with distinct_count and we may get the incorrect for other udaf like count which may not crush.

This issue can be reproduced for large window and multiple threads with enough data. We add some debug code and found that the row pointer is corrupt or invalid when accessing the previous rows in window.

It may be the issue of row pointers in buffered window list. We pass the Spark DirectByteBuffer as row pointer and it may be released by JVM after current row's computation. We have to find another to manage the memory for Spark DirectByteBuffer which may be used in other row's computation later.

tobegit3hub avatar Aug 16 '22 04:08 tobegit3hub