tiup icon indicating copy to clipboard operation
tiup copied to clipboard

cluster: prompt for instance restart

Open dveeden opened this issue 9 months ago • 7 comments

What problem does this PR solve?

When restarting TiDB nodes in rapid succession and/or concurrently this can cause disruption for applications due to:

  • Cold caches
  • Loadbalancers that need to do health checks
  • Transaction retries for aborted transactions

What is changed and how it works?

This adds a --restart-timeout option to tiup cluster upgrade.

This then causes tiup to wait after instance restarts for either this timeout or a key press.

This allows the person that does the upgrade to verify the host that was restarted has become healthy before continuing.

This changes the info that is shown at the start to show the concurrency to make this more obvious to users.

Check List

Tests

  • Manual test (add detailed scripts or steps below)

Code changes

  • Has exported function/method change

Side effects

  • Increased code complexity

Related changes

  • Need to update the documentation

Release notes:

A `--restart-timeout` option was added to allow more control over restart speed

dveeden avatar Mar 06 '25 10:03 dveeden

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign bb7133 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot[bot] avatar Mar 06 '25 10:03 ti-chi-bot[bot]

With a --restart-timeout 2m:

# ps -C tidb-server -o cmd,stime
CMD                         STIME
bin/tidb-server -P 4000 --s 12:00
bin/tidb-server -P 4001 --s 12:02
bin/tidb-server -P 4002 --s 12:04

dveeden avatar Mar 06 '25 11:03 dveeden

/cc @xhebox @breezewish

dveeden avatar Mar 11 '25 08:03 dveeden

Fix CI plz @dveeden

xhebox avatar Apr 28 '25 03:04 xhebox

[LGTM Timeline notifier]

Timeline:

  • 2025-04-28 03:17:34.635732446 +0000 UTC m=+843998.447522819: :ballot_box_with_check: agreed by xhebox.
  • 2025-04-28 05:14:46.116685074 +0000 UTC m=+851029.928475455: :heavy_multiplication_x::repeat: reset by dveeden.

ti-chi-bot[bot] avatar Apr 28 '25 05:04 ti-chi-bot[bot]

New changes are detected. LGTM label has been removed.

ti-chi-bot[bot] avatar Apr 28 '25 05:04 ti-chi-bot[bot]

/retest

dveeden avatar Apr 28 '25 05:04 dveeden