manticoresearch
manticoresearch copied to clipboard
Auto-bootstrapping an all-down cluster
Proposal:
For production use, cluster can greatly increase Scalability, but we have came across servral unexpected powering off event:power down,and power came back in few seconds or more.
Each time this happens, all nodes go down all of a sudden. Without manual intervention, the whole application is out of use or some unexpected thing will happen.
Even with manual intervention, it’s not easy ! All the solutions in the manual cann't be that helpful.
Firstly, by the time we realize this event, the power may have been restored and the entire cluster may have formed multiple independent nodes, and client could not know about it.
Secondly, we need to check grastate.dat to see if one node has safe_to_bootstrap, but in most cases, none of the nodes has it when a sudden powering off happens, and we need to check the seqno, but this is not safe because of the first reason, unexpected data bugs could happen.
So, sudden powering off event is a real pain in the ass for the whole application. It’s almost impossible for us to assign someone standing by for manually checking which node has the latest data for a cluster, this has to be done internally by manticore node itself communicating with each other to negotiate a valid bootstrap one.
so, Auto-bootstrapping an all-down cluster is quite needed, With this feature, we dare to actually use manticore in a production environment
Checklist:
To be completed by the assignee. Check off tasks that have been completed or are not applicable.
- [ ] Implementation completed
- [ ] Tests developed
- [ ] Documentation updated
- [ ] Documentation reviewed
- [ ] Changelog updated
- [x] OpenAPI YAML updated and issue created to rebuild clients