pinot icon indicating copy to clipboard operation
pinot copied to clipboard

Recover real-time consumption when consuming segment stuck

Open Jackie-Jiang opened this issue 3 years ago • 4 comments

One possible solution is to manually delete the consuming segment and let the controller create a new one during real-time segment validation. But this scenario is not covered in the PinotLLCRealtimeSegmentManager.ensureAllPartitionsConsuming() yet.

Jackie-Jiang avatar Apr 27 '21 20:04 Jackie-Jiang

Another way is to manually set the consuning segment to be in OFFLINE state (if it s not already so).

mcvsubbu avatar Apr 27 '21 22:04 mcvsubbu

@Jackie-Jiang what was the exact scenario when consumption got "stuck" ? As of now, it is the case that we retry a few times and if the stream throws exception, then we automatically mark the state as OFFLINE in idealstate. If all replicas are marked OFFLINE, then automatic recovery happens through periodic job. If some replicas are OFFLINE, then others are allowed to complete the segment, and eventually all replicas have a copy of he completed segment.

By "stuck", did you mean bad data? The only way to recover from bad data as of now is to let the periodic job keep retrying until the bad offset is retained out of the underlying stream, and it will eventually pick an offset and continue consumption. A long time to wait in production use cases. Is this the scenario you were referring to?

mcvsubbu avatar Nov 16 '21 17:11 mcvsubbu

@mcvsubbu I don't remember the exact scenario when this ticket was created. But I do think we should handle the case of a partition not having a consuming segment. It could happen in the following cases:

  • Consuming segment manually deleted accidentally
  • Schema evolution to recreate the consuming segment
  • Bootstrap a realtime table with immutable segments (e.g. cloning a realtime table)

Jackie-Jiang avatar Nov 17 '21 22:11 Jackie-Jiang

I believe now with pause/resume feature, specifically consumeFrom option of resume endpoint, we can address the requirements of this issue. @Jackie-Jiang what do you think? Should we close this issue?

sajjad-moradi avatar Aug 30 '22 23:08 sajjad-moradi

@sajjad-moradi Yes, I think the pause/resume can be used to address all 3 cases mentioned above. Thanks for adding the great feature!

Jackie-Jiang avatar Aug 31 '22 21:08 Jackie-Jiang