cluster: prompt for instance restart
What problem does this PR solve?
When restarting TiDB nodes in rapid succession and/or concurrently this can cause disruption for applications due to:
- Cold caches
- Loadbalancers that need to do health checks
- Transaction retries for aborted transactions
What is changed and how it works?
This adds a --restart-timeout option to tiup cluster upgrade.
This then causes tiup to wait after instance restarts for either this timeout or a key press.
This allows the person that does the upgrade to verify the host that was restarted has become healthy before continuing.
This changes the info that is shown at the start to show the concurrency to make this more obvious to users.
Check List
Tests
- Manual test (add detailed scripts or steps below)
Code changes
- Has exported function/method change
Side effects
- Increased code complexity
Related changes
- Need to update the documentation
Release notes:
A `--restart-timeout` option was added to allow more control over restart speed
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign bb7133 for approval. For more information see the Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
With a --restart-timeout 2m:
# ps -C tidb-server -o cmd,stime
CMD STIME
bin/tidb-server -P 4000 --s 12:00
bin/tidb-server -P 4001 --s 12:02
bin/tidb-server -P 4002 --s 12:04
/cc @xhebox @breezewish
Fix CI plz @dveeden
[LGTM Timeline notifier]
Timeline:
New changes are detected. LGTM label has been removed.
/retest