OpenMLDB
OpenMLDB copied to clipboard
Core when enabling UnsafeRowOpt with multiple threads in Yarn cluster
Get the core dump when enabling UnsafeRowOpt with multiple threads.
![119469f63205147a96a6be8a655cdb44](https://user-images.githubusercontent.com/2715000/182990259-329d983a-c24f-4aea-aec0-72d3ebad1733.png)
The bytes array are corrupt and get unexpected offset.
![image](https://user-images.githubusercontent.com/2715000/183637232-d6241e66-ec71-48b1-a84a-fc3767d67176.png)
We can simply the case and it happens when we are using distinct_count
for string columns. Here is the simplest case.
select
distinct_count(`publisher_id`) over flattenRequest_userId_eventTime_0s_2d_1000 as flattenRequest_publisher_id_window_unique_count_60
from
`flattenRequest`
window
flattenRequest_userId_eventTime_0s_2d_1000 as (partition by `userId` order by `eventTime` rows_range between 172799999 preceding and 0s preceding MAXSIZE 1000)
And we found that the implementation of distinct_count
which uses std::unordered_set
is not thread-safe.
It is not related with distinct_count
and we may get the incorrect for other udaf like count
which may not crush.
This issue can be reproduced for large window and multiple threads with enough data. We add some debug code and found that the row pointer is corrupt or invalid when accessing the previous rows in window.
It may be the issue of row pointers in buffered window list. We pass the Spark DirectByteBuffer as row pointer and it may be released by JVM after current row's computation. We have to find another to manage the memory for Spark DirectByteBuffer which may be used in other row's computation later.