risingwave Tracking: improve the robustness of recovery

Tracking: improve the robustness of recovery

Open wenym1 opened this issue 3 years ago • 2 comments

Our current recovery logic is not robust.

We have multiple concurrent in-flight barrier and will trigger recovery when any of the barrier failed. The recovery will start clearing the memory state of hummock without waiting for other concurrent barriers to finish, which causes the other barrier's attempt to sync the data panic.

Related issues are:

[ ] #5353
[ ] #5392

Both of CN and meta should be modified.

The CN should failed all running sync task when any of the sync task and wait for state cleaning triggered by meta.
The meta barrier manager should have a clearer state machine to handle the recovery and normal logic.

Sep 16 '22 07:09 wenym1

The recovery mechanism is being refactored. See https://github.com/risingwavelabs/risingwave/pull/5396#issuecomment-1249395039 Do we need to wait for that before working on this issue?

Sep 19 '22 04:09 hzxa21

The recovery mechanism is being refactored. See https://github.com/risingwavelabs/risingwave/pull/5396#issuecomment-1249395039 Do we need to wait for that before working on this issue?

I think we can start working on the storage side. The meta side can wait for the refactor.

Sep 19 '22 05:09 wenym1

After #5444, do we still need this or is it done?

Nov 15 '22 14:11 hzxa21

Seems that there is few issue related to recovery happening now. We may close the issue now.

Nov 16 '22 03:11 wenym1

risingwave risingwave copied to clipboard

Tracking: improve the robustness of recovery

risingwave
risingwave copied to clipboard