risingwave
risingwave copied to clipboard
Tracking: improve the robustness of recovery
Our current recovery logic is not robust.
We have multiple concurrent in-flight barrier and will trigger recovery when any of the barrier failed. The recovery will start clearing the memory state of hummock without waiting for other concurrent barriers to finish, which causes the other barrier's attempt to sync the data panic.
Related issues are:
- [ ] #5353
- [ ] #5392
Both of CN and meta should be modified.
- The CN should failed all running sync task when any of the sync task and wait for state cleaning triggered by meta.
- The meta barrier manager should have a clearer state machine to handle the recovery and normal logic.
The recovery mechanism is being refactored. See https://github.com/risingwavelabs/risingwave/pull/5396#issuecomment-1249395039 Do we need to wait for that before working on this issue?
The recovery mechanism is being refactored. See https://github.com/risingwavelabs/risingwave/pull/5396#issuecomment-1249395039 Do we need to wait for that before working on this issue?
I think we can start working on the storage side. The meta side can wait for the refactor.
After #5444, do we still need this or is it done?
Seems that there is few issue related to recovery happening now. We may close the issue now.