risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

Tracking: improve the robustness of recovery

Open wenym1 opened this issue 3 years ago • 2 comments

Our current recovery logic is not robust.

We have multiple concurrent in-flight barrier and will trigger recovery when any of the barrier failed. The recovery will start clearing the memory state of hummock without waiting for other concurrent barriers to finish, which causes the other barrier's attempt to sync the data panic.

Related issues are:

  • [ ] #5353
  • [ ] #5392

Both of CN and meta should be modified.

  • The CN should failed all running sync task when any of the sync task and wait for state cleaning triggered by meta.
  • The meta barrier manager should have a clearer state machine to handle the recovery and normal logic.

wenym1 avatar Sep 16 '22 07:09 wenym1

The recovery mechanism is being refactored. See https://github.com/risingwavelabs/risingwave/pull/5396#issuecomment-1249395039 Do we need to wait for that before working on this issue?

hzxa21 avatar Sep 19 '22 04:09 hzxa21

The recovery mechanism is being refactored. See https://github.com/risingwavelabs/risingwave/pull/5396#issuecomment-1249395039 Do we need to wait for that before working on this issue?

I think we can start working on the storage side. The meta side can wait for the refactor.

wenym1 avatar Sep 19 '22 05:09 wenym1

After #5444, do we still need this or is it done?

hzxa21 avatar Nov 15 '22 14:11 hzxa21

Seems that there is few issue related to recovery happening now. We may close the issue now.

wenym1 avatar Nov 16 '22 03:11 wenym1