tiflow
tiflow copied to clipboard
*(dm): retry statement execution on invalid connections in some scenarios (#12078)
This is a manual cherry-pick of #12078
What problem does this PR solve?
This PR tries to address a case where, when a TiDB machine is restarted, stale connections will live on in the connection pool. When DM attempts to use these connections, it will receive an "Invalid Connection" error, which will cause the task to pause and increase replication lag while waiting to reschedule it.
Issue Number: close #11805
What is changed and how it works?
When we get an invalid connection error on a begin statement, we can safely retry after resetting the connection. The reset connection call can also fail due to similar reasons or due to a load balancer layer refusing the connection, so we can also do a retry with backoff here.
Check List
Tests
Tested on our organization's deployment of DM.
- Unit test
- Integration test
- Manual test (add detailed scripts or steps below)
- No code
Questions
Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?
Release note
Retry when the DM syncer receives an invalid connection error while at the beginning of a transaction to avoid full task pauses that increase replication lag
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
This cherry pick PR is for a release branch and has not yet been approved by triage owners.
Adding the do-not-merge/cherry-pick-not-approved label.
To merge this cherry pick:
- It must be approved by the approvers firstly.
- AFTER it has been approved by approvers, please wait for the cherry-pick merging approval from triage owners.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Hi @cgtz. Thanks for your PR.
I'm waiting for a pingcap member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign lance6716 for approval. For more information see the Code Review Process. Please ensure that each of them provides their approval before proceeding.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
@gemini-code-assist[bot]: adding LGTM is restricted to approvers and reviewers in OWNERS files.
In response to this:
Code Review
This pull request effectively addresses the issue of stale connections in DM causing task pauses by introducing a retry mechanism for
BEGINstatements on "Invalid Connection" errors and for the connection reset operation itself. The changes are well-targeted, and the introduction of a specific error code forBEGINfailures (ErrDBExecuteFailedBegin) improves error clarity. Logging has also been enhanced for better observability during retry attempts.The core logic for retrying appears sound, and the handling of different retryable scenarios for connection resets (including
connection refused) is a good improvement. The updates to test scripts to accommodate potentially longer retry durations are noted and seem reasonable for ensuring test stability with the new resilience features.Summary of Findings
- Logging Consistency: In
dm/syncer/dbconn/db.go, within theretryableFnfunction, when logging an error after a failed connection reset attempt (around line 273),log.ShortError(resetErr)is used. Other parts of the PR have updated similar logging to usezap.Error(resetErr)for structured logging. For consistency, this instance could also be updated. This was not commented on directly as it's considered low severity.- Log Level for Retry Attempts: In
dm/syncer/dbconn/db.go, theresetConnRetryableFnlogs atErrorlevel when retrying a connection reset. This might be too noisy for transient, recoverable errors. AWarnlevel is suggested for better operational clarity (see review comment).Merge Readiness
The pull request introduces valuable resiliency improvements. The core logic is sound. There is one medium-severity suggestion regarding logging practices in
dm/syncer/dbconn/db.gothat would enhance operational clarity. Addressing this point would be beneficial before merging.As an AI, I am not authorized to approve pull requests. Please ensure further review and approval from authorized maintainers before merging.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
@lance6716 @GMHDBJD , here is the new cherry pick PR for DM invalid connection retries. Please take a look. The mac os build seems to be failing because of an outdated github actions/cache version and not related to this change.
/cc @OliverS929
@lance6716: GitHub didn't allow me to request PR reviews from the following users: OliverS929.
Note that only pingcap members and repo collaborators can review this PR, and authors cannot review their own PRs.
In response to this:
/cc @OliverS929
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.