risingwave
risingwave copied to clipboard
Bug(compaction): Unable to trigger split in time, when barrier latency is high
Describe the bug
In Hummock, the decision to split a compaction group is made by counting the flush throughput of the table. https://github.com/risingwavelabs/risingwave/blob/41f4ad55c636836fc9c7f7860ada535e26dbd6ca/src/meta/src/hummock/manager/mod.rs#L2597
To minimize the effects of jitter, we introduce the concept of window_size to make the statistics more accurate and add new statistics to the window at each commit_epoch. https://github.com/risingwavelabs/risingwave/blob/41f4ad55c636836fc9c7f7860ada535e26dbd6ca/src/meta/src/hummock/manager/mod.rs# L1779
Recently, we found that when a Barrier contains a large amount of data, we can't update the statistical information in time (affected by the barrier latency), and thus can't trigger the split in time.
Error message/log
No response
To Reproduce
No response
Expected behavior
No response
How did you deploy RisingWave?
No response
The version of RisingWave
No response
Additional context
No response
I'm assuming that the write amplification within cg2 / cg3 is still due to the data misalignment factor. It doesn't seem reasonable to perform a split directly during the new table creation or recovery phase. (We don't support merge at the moment).
I prefer to do some data analysis in the flush phase and perform a split on the SST to promote boundary alignment.
@Little-Wallace @zwang28 @hzxa21
I prefer to do some data analysis in the flush phase and perform a split on the SST to promote boundary alignment.
By split you mean putting data related to specific table ids in separate SSTs, not splitting compaction group, right?
If that is the case, is this a permanent change (applied to all future data related to these tables) or a temporary change (only applied to data related to these tables in some period)?
This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.