Lightning: add retry if transaction failed while fetching task metas
What problem does this PR solve?
Issue Number: #53042
Problem Summary: The transaction may encounter an error message stating "Error 1205: Lock wait timeout exceeded; try restarting transaction". This error can cause the import job to fail. The import job does not always start from the checkpoint, as it depends on the stage at which the failure occurred. If the transaction fails after the import/ingestion is completed, the job will not start from the checkpoint. As a result, the entire bulk load job needs to be restarted, which is a costly operation and can potentially violate our SLOs.
What changed and how does it work?
Add retry with back off if transaction fails. It is going to retry up to 5 times with maxbackoff of 30 seconds and baseline backoff of 1 second.
Check List
Tests
- [ ] Unit test
- [ ] Integration test
- [ ] Manual test (add detailed scripts or steps below)
- [ ] No need to test
- [ ] I checked and no code files have been changed.
Side effects
- [ ] Performance regression: Consumes more CPU
- [ ] Performance regression: Consumes more Memory
- [ ] Breaking backward compatibility
Documentation
- [ ] Affects user behaviors
- [ ] Contains syntax changes
- [ ] Contains variable changes
- [ ] Contains experimental features
- [ ] Changes MySQL compatibility
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
None
[FORMAT CHECKER NOTIFICATION]
Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.
:open_book: For more info, you can check the "Contribute Code" section in the development guide.
Notice: To remove the do-not-merge/needs-tests-checked label, please finished the tests then check the finished items in description.
For example:
Tests
- [x] Unit test
- [ ] Integration test
- [ ] Manual test (add detailed scripts or steps below)
- [ ] No code
:open_book: For more info, you can check the "Contribute Code" section in the development guide.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign okjiang for approval. For more information see the Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Hi @mittalrishabh. Thanks for your PR.
I'm waiting for a pingcap member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/ok-to-test
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 73.1546%. Comparing base (
b421b72) to head (1023dcf). Report is 982 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #53043 +/- ##
================================================
+ Coverage 72.0329% 73.1546% +1.1216%
================================================
Files 1499 1508 +9
Lines 431222 437740 +6518
================================================
+ Hits 310622 320227 +9605
+ Misses 101291 97388 -3903
- Partials 19309 20125 +816
| Flag | Coverage Δ | |
|---|---|---|
| integration | 29.0494% <ø> (?) |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Components | Coverage Δ | |
|---|---|---|
| dumpling | 53.9957% <ø> (ø) |
|
| parser | ∅ <ø> (∅) |
|
| br | 46.3671% <ø> (+8.0212%) |
:arrow_up: |
/test build
Hi, I check the issue @mittalrishabh is reporting, I think we already have a retry mechanism here, all we need to do is to add the ErrLockWaitTimeout to here so the error can be retried. What do you think @lance6716
Hi, I check the issue @mittalrishabh is reporting, I think we already have a retry mechanism here, all we need to do is to add the
ErrLockWaitTimeoutto here so the error can be retried. What do you think @lance6716
Yes I think make ErrLockWaitTimeout a retryable error is easier.
@mittalrishabh: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| idc-jenkins-ci-tidb/check_dev | 1023dcf4f929777d1c8f5cad71fffe8c67f27731 | link | true | /test check-dev |
| idc-jenkins-ci-tidb/unit-test | 1023dcf4f929777d1c8f5cad71fffe8c67f27731 | link | true | /test unit-test |
| idc-jenkins-ci-tidb/build | 1023dcf4f929777d1c8f5cad71fffe8c67f27731 | link | true | /test build |
| pull-lightning-integration-test | 1023dcf4f929777d1c8f5cad71fffe8c67f27731 | link | true | /test pull-lightning-integration-test |
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
replaced by https://github.com/pingcap/tidb/pull/55670