longhorn
longhorn copied to clipboard
[IMPROVEMENT] Improve the replica deletion workflow
Is your improvement request related to a feature? Please describe
As mentioned in this comment, the current replica removal workflow is:
- Longhorn manager directly stops the running replica process
- The engine process realizes the unavailability of the replica, set the mode to ERR, then reports it to the longhorn manager
- Longhorn manager removes the replica record from the engine spec (then status) after receiving the report, and asks the engine process to stop monitoring the replica.
Describe the solution you'd like
The running replica removal workflows can be a reverse of the volume attachment flow. And it's better to ask Longhorn manager to control everything rather than relying on the engine process's report:
- Longhorn manager asks the engine process to stop tracking/monitoring the running replica process.
- Longhorn manager wait for the tracking stopped then delete the replica process
Not sure if this is applicable or makes the whole workflow easier. What do you think? @joshimoo @PhanLe1010 @innobead
With your proposal, we can also avoid some misleading error messages when the engine panic because it cannot connect to the its replicas
Hey team! Please add your planning poker estimate with Zenhub @derekbit @ejweber @PhanLe1010