tiflow icon indicating copy to clipboard operation
tiflow copied to clipboard

*(dm): retry statement execution on invalid connections in some scenarios (#12078)

Open cgtz opened this issue 6 months ago • 8 comments

This is a manual cherry-pick of #12078

What problem does this PR solve?

This PR tries to address a case where, when a TiDB machine is restarted, stale connections will live on in the connection pool. When DM attempts to use these connections, it will receive an "Invalid Connection" error, which will cause the task to pause and increase replication lag while waiting to reschedule it.

Issue Number: close #11805

What is changed and how it works?

When we get an invalid connection error on a begin statement, we can safely retry after resetting the connection. The reset connection call can also fail due to similar reasons or due to a load balancer layer refusing the connection, so we can also do a retry with backoff here.

Check List

Tests

Tested on our organization's deployment of DM.

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Retry when the DM syncer receives an invalid connection error while at the beginning of a transaction to avoid full task pauses that increase replication lag

cgtz avatar May 15 '25 16:05 cgtz

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot[bot] avatar May 15 '25 16:05 ti-chi-bot[bot]

This cherry pick PR is for a release branch and has not yet been approved by triage owners. Adding the do-not-merge/cherry-pick-not-approved label.

To merge this cherry pick:

  1. It must be approved by the approvers firstly.
  2. AFTER it has been approved by approvers, please wait for the cherry-pick merging approval from triage owners.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot[bot] avatar May 15 '25 16:05 ti-chi-bot[bot]

Hi @cgtz. Thanks for your PR.

I'm waiting for a pingcap member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot[bot] avatar May 15 '25 16:05 ti-chi-bot[bot]

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign lance6716 for approval. For more information see the Code Review Process. Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot[bot] avatar May 15 '25 16:05 ti-chi-bot[bot]

@gemini-code-assist[bot]: adding LGTM is restricted to approvers and reviewers in OWNERS files.

In response to this:

Code Review

This pull request effectively addresses the issue of stale connections in DM causing task pauses by introducing a retry mechanism for BEGIN statements on "Invalid Connection" errors and for the connection reset operation itself. The changes are well-targeted, and the introduction of a specific error code for BEGIN failures (ErrDBExecuteFailedBegin) improves error clarity. Logging has also been enhanced for better observability during retry attempts.

The core logic for retrying appears sound, and the handling of different retryable scenarios for connection resets (including connection refused) is a good improvement. The updates to test scripts to accommodate potentially longer retry durations are noted and seem reasonable for ensuring test stability with the new resilience features.

Summary of Findings

  • Logging Consistency: In dm/syncer/dbconn/db.go, within the retryableFn function, when logging an error after a failed connection reset attempt (around line 273), log.ShortError(resetErr) is used. Other parts of the PR have updated similar logging to use zap.Error(resetErr) for structured logging. For consistency, this instance could also be updated. This was not commented on directly as it's considered low severity.
  • Log Level for Retry Attempts: In dm/syncer/dbconn/db.go, the resetConnRetryableFn logs at Error level when retrying a connection reset. This might be too noisy for transient, recoverable errors. A Warn level is suggested for better operational clarity (see review comment).

Merge Readiness

The pull request introduces valuable resiliency improvements. The core logic is sound. There is one medium-severity suggestion regarding logging practices in dm/syncer/dbconn/db.go that would enhance operational clarity. Addressing this point would be beneficial before merging.

As an AI, I am not authorized to approve pull requests. Please ensure further review and approval from authorized maintainers before merging.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot[bot] avatar May 15 '25 16:05 ti-chi-bot[bot]

@lance6716 @GMHDBJD , here is the new cherry pick PR for DM invalid connection retries. Please take a look. The mac os build seems to be failing because of an outdated github actions/cache version and not related to this change.

cgtz avatar May 15 '25 16:05 cgtz

/cc @OliverS929

lance6716 avatar May 16 '25 01:05 lance6716

@lance6716: GitHub didn't allow me to request PR reviews from the following users: OliverS929.

Note that only pingcap members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @OliverS929

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ti-chi-bot[bot] avatar May 16 '25 01:05 ti-chi-bot[bot]