tiup cluster: prompt for instance restart

What problem does this PR solve?

When restarting TiDB nodes in rapid succession and/or concurrently this can cause disruption for applications due to:

Cold caches
Loadbalancers that need to do health checks
Transaction retries for aborted transactions

What is changed and how it works?

This adds a --restart-timeout option to tiup cluster upgrade.

This then causes tiup to wait after instance restarts for either this timeout or a key press.

This allows the person that does the upgrade to verify the host that was restarted has become healthy before continuing.

This changes the info that is shown at the start to show the concurrency to make this more obvious to users.

Check List

Tests

Manual test (add detailed scripts or steps below)

Code changes

Has exported function/method change

Side effects

Increased code complexity

Related changes

Need to update the documentation

Release notes:

A `--restart-timeout` option was added to allow more control over restart speed

Mar 06 '25 10:03 dveeden

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign bb7133 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Mar 06 '25 10:03 ti-chi-bot[bot]

With a --restart-timeout 2m:

# ps -C tidb-server -o cmd,stime
CMD                         STIME
bin/tidb-server -P 4000 --s 12:00
bin/tidb-server -P 4001 --s 12:02
bin/tidb-server -P 4002 --s 12:04

Mar 06 '25 11:03 dveeden

/cc @xhebox @breezewish

Mar 11 '25 08:03 dveeden

Fix CI plz @dveeden

Apr 28 '25 03:04 xhebox

[LGTM Timeline notifier]

Timeline:

2025-04-28 03:17:34.635732446 +0000 UTC m=+843998.447522819: :ballot_box_with_check: agreed by xhebox.
2025-04-28 05:14:46.116685074 +0000 UTC m=+851029.928475455: :heavy_multiplication_x::repeat: reset by dveeden.

Apr 28 '25 05:04 ti-chi-bot[bot]

New changes are detected. LGTM label has been removed.

Apr 28 '25 05:04 ti-chi-bot[bot]

/retest

Apr 28 '25 05:04 dveeden