tidb icon indicating copy to clipboard operation
tidb copied to clipboard

Lightning: add retry if transaction failed while fetching task metas

Open mittalrishabh opened this issue 1 year ago • 9 comments

What problem does this PR solve?

Issue Number: #53042

Problem Summary: The transaction may encounter an error message stating "Error 1205: Lock wait timeout exceeded; try restarting transaction". This error can cause the import job to fail. The import job does not always start from the checkpoint, as it depends on the stage at which the failure occurred. If the transaction fails after the import/ingestion is completed, the job will not start from the checkpoint. As a result, the entire bulk load job needs to be restarted, which is a costly operation and can potentially violate our SLOs.

What changed and how does it work?

Add retry with back off if transaction fails. It is going to retry up to 5 times with maxbackoff of 30 seconds and baseline backoff of 1 second.

Check List

Tests

  • [ ] Unit test
  • [ ] Integration test
  • [ ] Manual test (add detailed scripts or steps below)
  • [ ] No need to test
    • [ ] I checked and no code files have been changed.

Side effects

  • [ ] Performance regression: Consumes more CPU
  • [ ] Performance regression: Consumes more Memory
  • [ ] Breaking backward compatibility

Documentation

  • [ ] Affects user behaviors
  • [ ] Contains syntax changes
  • [ ] Contains variable changes
  • [ ] Contains experimental features
  • [ ] Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

mittalrishabh avatar May 06 '24 23:05 mittalrishabh

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

:open_book: For more info, you can check the "Contribute Code" section in the development guide.


Notice: To remove the do-not-merge/needs-tests-checked label, please finished the tests then check the finished items in description.

For example:

Tests

  • [x] Unit test
  • [ ] Integration test
  • [ ] Manual test (add detailed scripts or steps below)
  • [ ] No code

:open_book: For more info, you can check the "Contribute Code" section in the development guide.

ti-chi-bot[bot] avatar May 06 '24 23:05 ti-chi-bot[bot]

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign okjiang for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot[bot] avatar May 06 '24 23:05 ti-chi-bot[bot]

Hi @mittalrishabh. Thanks for your PR.

I'm waiting for a pingcap member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ti-chi-bot[bot] avatar May 06 '24 23:05 ti-chi-bot[bot]

/ok-to-test

lance6716 avatar May 07 '24 07:05 lance6716

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 73.1546%. Comparing base (b421b72) to head (1023dcf). Report is 982 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #53043        +/-   ##
================================================
+ Coverage   72.0329%   73.1546%   +1.1216%     
================================================
  Files          1499       1508         +9     
  Lines        431222     437740      +6518     
================================================
+ Hits         310622     320227      +9605     
+ Misses       101291      97388      -3903     
- Partials      19309      20125       +816     
Flag Coverage Δ
integration 29.0494% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 53.9957% <ø> (ø)
parser ∅ <ø> (∅)
br 46.3671% <ø> (+8.0212%) :arrow_up:

codecov[bot] avatar May 07 '24 07:05 codecov[bot]

/test build

lance6716 avatar May 08 '24 01:05 lance6716

Hi, I check the issue @mittalrishabh is reporting, I think we already have a retry mechanism here, all we need to do is to add the ErrLockWaitTimeout to here so the error can be retried. What do you think @lance6716

guoshouyan avatar Aug 27 '24 17:08 guoshouyan

Hi, I check the issue @mittalrishabh is reporting, I think we already have a retry mechanism here, all we need to do is to add the ErrLockWaitTimeout to here so the error can be retried. What do you think @lance6716

Yes I think make ErrLockWaitTimeout a retryable error is easier.

lance6716 avatar Aug 28 '24 03:08 lance6716

@mittalrishabh: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
idc-jenkins-ci-tidb/check_dev 1023dcf4f929777d1c8f5cad71fffe8c67f27731 link true /test check-dev
idc-jenkins-ci-tidb/unit-test 1023dcf4f929777d1c8f5cad71fffe8c67f27731 link true /test unit-test
idc-jenkins-ci-tidb/build 1023dcf4f929777d1c8f5cad71fffe8c67f27731 link true /test build
pull-lightning-integration-test 1023dcf4f929777d1c8f5cad71fffe8c67f27731 link true /test pull-lightning-integration-test

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

ti-chi-bot[bot] avatar Sep 04 '24 05:09 ti-chi-bot[bot]

replaced by https://github.com/pingcap/tidb/pull/55670

lance6716 avatar Sep 13 '24 05:09 lance6716