tidb
tidb copied to clipboard
util/dbterror: refine reorg retryable errors
What problem does this PR solve?
Issue Number: close #52805
Problem Summary:
[2024/04/20 07:11:41.700 +08:00] [ERROR] [local.go:1231] ["scan region failed"] [error="scan regions from start-key:7480000000000000FFF55F698000000000FF0000570130333532FF35373637FF383237FF2D36393932FF3132FF33383733362DFF32FF30393239363931FFFF3835392D30393837FFFF35343532313237FF2DFF373432313939FF3330FF3138332D33FF363238FF33323834FF3839362DFF393439FF3433363535FF3439FF312D38393330FF37FF3633373839382DFFFF3133313031393538FFFF3631302D313337FF37FF333239333633FF3200FE0380000000FF024DDEEB00000000FB, err: rpc error: code = Canceled desc = context canceled: [BR:PD:ErrPDBatchScanRegion]batch scan region"] [region_len=0] [startKey=7480000000000000FFF55F698000000000FF0000570130333532FF35373637FF383237FF2D36393932FF3132FF33383733362DFF32FF30393239363931FFFF3835392D30393837FFFF35343532313237FF2DFF373432313939FF3330FF3138332D33FF363238FF33323834FF3839362DFF393439FF3433363535FF3439FF312D38393330FF37FF3633373839382DFFFF3133313031393538FFFF3631302D313337FF37FF333239333633FF3200FE0380000000FF024DDEEB00000000FB] [endKey=7480000000000000FFF55F698000000000FF0000570130343637FF32353932FF313937FF2D39393039FF3639FF33323933372DFF30FF31323038373634FFFF3038392D37303539FFFF32373837343433FF2DFF343339303438FF3233FF3033372D36FF393739FF30393934FF3733382DFF303932FF3732353530FF3531FF352D32313035FF31FF3738353837352DFFFF3633303237353838FFFF3137352D373631FF35FF353832333132FF3800FE0380000000FF03C80D0600000000FC]
[2024/04/20 07:11:41.700 +08:00] [ERROR] [local.go:1231] ["scan region failed"] [error="scan regions from start-key:7480000000000000FFF55F698000000000FF0000570130393736FF34323134FF383637FF2D30303139FF3639FF39343932302DFF33FF37313032303831FFFF3936332D36353937FFFF35333033323136FF2DFF333933383837FF3438FF3232352D36FF353534FF33333235FF3932392DFF393430FF3136303632FF3036FF342D39363136FF33FF3439343637302DFFFF3734353331303131FFFF3631332D313638FF31FF333138333532FF3700FE0380000000FF036789D400000000FB, err: rpc error: code = Canceled desc = context canceled: [BR:PD:ErrPDBatchScanRegion]batch scan region"] [region_len=0] [startKey=7480000000000000FFF55F698000000000FF0000570130393736FF34323134FF383637FF2D30303139FF3639FF39343932302DFF33FF37313032303831FFFF3936332D36353937FFFF35333033323136FF2DFF333933383837FF3438FF3232352D36FF353534FF33333235FF3932392DFF393430FF3136303632FF3036FF342D39363136FF33FF3439343637302DFFFF3734353331303131FFFF3631332D313638FF31FF333138333532FF3700FE0380000000FF036789D400000000FB] [endKey=7480000000000000FFF55F698000000000FF0000570131303534FF39343737FF353933FF2D33393439FF3235FF30323932362DFF33FF32313338363939FFFF3237312D30313037FFFF30353533393733FF2DFF333338303433FF3733FF3437362D31FF323032FF37343838FF3430362DFF363333FF3730313639FF3235FF382D38393834FF30FF3130393637342DFFFF3236373632353835FFFF3633302D343939FF35FF353332393339FF3600FE0380000000FF00CF6E1E00000000FC]
[2024/04/20 07:11:41.703 +08:00] [WARN] [local.go:1374] ["meet retryable error when writing to TiKV"] [error="peer 153040, store 4, region 153039, epoch conf_ver:287 version:10326 : EOF"] ["job stage"=regionScanned]
[2024/04/20 07:11:41.703 +08:00] [ERROR] [local.go:1231] ["scan region failed"] [error="scan regions from start-key:7480000000000000FFF55F698000000000FF0000570131393530FF33323235FF363437FF2D31303439FF3636FF35353334342DFF30FF30373731363936FFFF3039332D36303334FFFF36363739343034FF2DFF343938383639FF3531FF3838322D39FF303336FF30313537FF3630362DFF353434FF3036343438FF3437FF392D31333631FF39FF3736343336342DFFFF3738313037373134FFFF3435352D303633FF32FF383831313131FF3500FE0380000000FF02A8A03F00000000FB, err: rpc error: code = Canceled desc = context canceled: [BR:PD:ErrPDBatchScanRegion]batch scan region"] [region_len=0] [startKey=7480000000000000FFF55F698000000000FF0000570131393530FF33323235FF363437FF2D31303439FF3636FF35353334342DFF30FF30373731363936FFFF3039332D36303334FFFF36363739343034FF2DFF343938383639FF3531FF3838322D39FF303336FF30313537FF3630362DFF353434FF3036343438FF3437FF392D31333631FF39FF3736343336342DFFFF3738313037373134FFFF3435352D303633FF32FF383831313131FF3500FE0380000000FF02A8A03F00000000FB] [endKey=7480000000000000FFF55F698000000000FF0000570132303331FF31323235FF333636FF2D30393236FF3630FF35383931322DFF30FF38353537333733FFFF3832382D39303835FFFF30363934343435FF2DFF353835383434FF3831FF3933352D30FF393534FF31313136FF3332362DFF333035FF3435303633FF3733FF332D39323236FF32FF3835313138352DFFFF3334373739303432FFFF3630352D303134FF39FF333634353038FF3500FE0380000000FF0144230600000000FC]
[2024/04/20 07:11:41.717 +08:00] [WARN] [local.go:1374] ["meet retryable error when writing to TiKV"] [error="peer 153068, store 4, region 153067, epoch conf_ver:287 version:10326 : EOF"] ["job stage"=regionScanned]
[2024/04/20 07:11:41.718 +08:00] [ERROR] [local.go:1231] ["scan region failed"] [error="scan regions from start-key:7480000000000000FFF55F698000000000FF0000570132353430FF32383938FF363733FF2D36363436FF3433FF31353039392DFF31FF32303035383131FFFF3630342D35373038FFFF33373335353035FF2DFF313134313633FF3932FF3436302D33FF343535FF32383638FF3433352DFF393630FF3434313136FF3430FF312D39393930FF32FF3636343030342DFFFF3338313139333330FFFF3633342D323439FF39FF363334343938FF3900FE0380000000FF0219229700000000FB, err: rpc error: code = Canceled desc = context canceled: [BR:PD:ErrPDBatchScanRegion]batch scan region"] [region_len=0] [startKey=7480000000000000FFF55F698000000000FF0000570132353430FF32383938FF363733FF2D36363436FF3433FF31353039392DFF31FF32303035383131FFFF3630342D35373038FFFF33373335353035FF2DFF313134313633FF3932FF3436302D33FF343535FF32383638FF3433352DFF393630FF3434313136FF3430FF312D39393930FF32FF3636343030342DFFFF3338313139333330FFFF3633342D323439FF39FF363334343938FF3900FE0380000000FF0219229700000000FB] [endKey=7480000000000000FFF55F698000000000FF0000570132363139FF34373135FF343633FF2D39323036FF3737FF31343432302DFF32FF32373031383138FFFF3336372D38383937FFFF31313739343335FF2DFF303239363030FF3434FF3432352D38FF353835FF31303432FF3930352DFF343937FF3935383336FF3331FF302D34303332FF33FF3133323430382DFFFF3838393737373432FFFF3432362D313035FF33FF303831303132FF3800FE0380000000FF01BE445E00000000FC]
[2024/04/20 07:11:41.720 +08:00] [WARN] [local.go:1374] ["meet retryable error when writing to TiKV"] [error="peer 152960, store 4, region 152959, epoch conf_ver:293 version:10326 : EOF"] ["job stage"=regionScanned]
[2024/04/20 07:11:41.721 +08:00] [ERROR] [local.go:1231] ["scan region failed"] [error="scan regions from start-key:7480000000000000FFF55F698000000000FF0000570130313935FF32363039FF323336FF2D31313032FF3639FF38333938372DFF38FF36303133373338FFFF3334372D32313235FFFF39383034383137FF2DFF383033323137FF3336FF3436322D36FF393032FF34333930FF3433332DFF303633FF3934353431FF3430FF382D38393435FF38FF3231323538332DFFFF3637383137373430FFFF3034302D333134FF34FF363037343834FF3200FE0380000000FF01006B1D00000000FB, err: rpc error: code = Canceled desc = context canceled: [BR:PD:ErrPDBatchScanRegion]batch scan region"] [region_len=0] [startKey=7480000000000000FFF55F698000000000FF0000570130313935FF32363039FF323336FF2D31313032FF3639FF38333938372DFF38FF36303133373338FFFF3334372D32313235FFFF39383034383137FF2DFF383033323137FF3336FF3436322D36FF393032FF34333930FF3433332DFF303633FF3934353431FF3430FF382D38393435FF38FF3231323538332DFFFF3637383137373430FFFF3034302D333134FF34FF363037343834FF3200FE0380000000FF01006B1D00000000FB] [endKey=7480000000000000FFF55F698000000000FF0000570130323733FF36343633FF343336FF2D30333539FF3437FF39363635332DFF37FF33303237373738FFFF3437342D38323531FFFF36373138393933FF2DFF323834383636FF3932FF3235382D31FF343437FF33313835FF3435302DFF323532FF3537373335FF3537FF342D32343437FF36FF3633363638372DFFFF3131363938313230FFFF3139392D393434FF30FF353039303832FF3600FE0380000000FF03F6F9AA00000000FC]
[2024/04/20 07:11:41.721 +08:00] [ERROR] [local.go:1582] ["do import meets error"] [error="scan regions from start-key:7480000000000000FFF55F698000000000FF0000570132313039FF37353737FF363839FF2D35373339FF3233FF35313334312DFF30FF37383038313338FFFF3733332D30363238FFFF34323835373036FF2DFF383830383432FF3436FF3238332D39FF343331FF33363332FF3330302DFF313831FF3635313430FF3530FF322D33323537FF35FF3730303530382DFFFF3639343936353336FFFF3837372D323231FF33FF393139393835FF3300FE0380000000FF0214DF5D00000000FB, err: rpc error: code = Canceled desc = context canceled: [BR:PD:ErrPDBatchScanRegion]batch scan region"]
[2024/04/20 07:11:41.721 +08:00] [ERROR] [backend.go:354] ["import failed"] [engineTag=sbtest1:87] [engineUUID=e267486f-a714-5042-9074-57ca82545a76] [retryCnt=0] [takeTime=1.354089568s] [error="scan regions from start-key:7480000000000000FFF55F698000000000FF0000570132313039FF37353737FF363839FF2D35373339FF3233FF35313334312DFF30FF37383038313338FFFF3733332D30363238FFFF34323835373036FF2DFF383830383432FF3436FF3238332D39FF343331FF33363332FF3330302DFF313831FF3635313430FF3530FF322D33323537FF35FF3730303530382DFFFF3639343936353336FFFF3837372D323231FF33FF393139393835FF3300FE0380000000FF0214DF5D00000000FB, err: rpc error: code = Canceled desc = context canceled: [BR:PD:ErrPDBatchScanRegion]batch scan region"]
[2024/04/20 07:11:41.721 +08:00] [ERROR] [engine.go:143] ["[ddl-ingest] ingest data into storage error"] [error="scan regions from start-key:7480000000000000FFF55F698000000000FF0000570132313039FF37353737FF363839FF2D35373339FF3233FF35313334312DFF30FF37383038313338FFFF3733332D30363238FFFF34323835373036FF2DFF383830383432FF3436FF3238332D39FF343331FF33363332FF3330302DFF313831FF3635313430FF3530FF322D33323537FF35FF3730303530382DFFFF3639343936353336FFFF3837372D323231FF33FF393139393835FF3300FE0380000000FF0214DF5D00000000FB, err: rpc error: code = Canceled desc = context canceled: [BR:PD:ErrPDBatchScanRegion]batch scan region"] ["job ID"=586] ["index ID"=87]
[2024/04/20 07:11:41.721 +08:00] [WARN] [index.go:940] ["[ddl] lightning import error"] [error="scan regions from start-key:7480000000000000FFF55F698000000000FF0000570132313039FF37353737FF363839FF2D35373339FF3233FF35313334312DFF30FF37383038313338FFFF3733332D30363238FFFF34323835373036FF2DFF383830383432FF3436FF3238332D39FF343331FF33363332FF3330302DFF313831FF3635313430FF3530FF322D33323537FF35FF3730303530382DFFFF3639343936353336FFFF3837372D323231FF33FF393139393835FF3300FE0380000000FF0214DF5D00000000FB, err: rpc error: code = Canceled desc = context canceled: [BR:PD:ErrPDBatchScanRegion]batch scan region"]
[2024/04/20 07:11:41.721 +08:00] [WARN] [terror.go:242] ["Unknown error class"] [class=BR]
What changed and how does it work?
Make "ErrPDBatchScanRegion" retryable.
Check List
Tests
- [ ] Unit test
- [ ] Integration test
- [x] Manual test (add detailed scripts or steps below)
- [ ] No need to test
- [ ] I checked and no code files have been changed.
Side effects
- [ ] Performance regression: Consumes more CPU
- [ ] Performance regression: Consumes more Memory
- [ ] Breaking backward compatibility
Documentation
- [ ] Affects user behaviors
- [ ] Contains syntax changes
- [ ] Contains variable changes
- [ ] Contains experimental features
- [ ] Changes MySQL compatibility
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
None
Hi @tangenta. Thanks for your PR.
PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Codecov Report
:x: Patch coverage is 96.00000% with 1 line in your changes missing coverage. Please review.
:white_check_mark: Project coverage is 74.2885%. Comparing base (f304f04) to head (28c1d2d).
:warning: Report is 3396 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #52808 +/- ##
================================================
+ Coverage 72.4716% 74.2885% +1.8168%
================================================
Files 1491 1533 +42
Lines 429012 450600 +21588
================================================
+ Hits 310912 334744 +23832
+ Misses 98878 94936 -3942
- Partials 19222 20920 +1698
| Flag | Coverage Δ | |
|---|---|---|
| integration | 50.4724% <88.0000%> (?) |
|
| unit | 72.0235% <68.0000%> (+0.0512%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Components | Coverage Δ | |
|---|---|---|
| dumpling | 78.1914% <ø> (∅) |
|
| parser | ∅ <ø> (∅) |
|
| br | 47.5259% <ø> (+5.4104%) |
:arrow_up: |
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
- :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.
/retest
@tangenta: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.
In response to this:
/retest
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
[LGTM Timeline notifier]
Timeline:
2024-04-23 06:29:40.468212715 +0000 UTC m=+68937.208115625: :ballot_box_with_check: agreed by ywqzzy.2024-04-23 06:43:04.454488589 +0000 UTC m=+69741.194391501: :ballot_box_with_check: agreed by wjhuang2016.
/retest
@Benjamin2037: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.
In response to this:
/retest
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/hold
I think the root cause is DDL canceled the context for subtask, but didn't cancel the context to save the status of subtask. they should be failed at the same time.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: Benjamin2037, wjhuang2016, ywqzzy
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [Benjamin2037,wjhuang2016,ywqzzy]
- ~~pkg/ddl/OWNERS~~ [Benjamin2037,wjhuang2016,ywqzzy]
- ~~pkg/lightning/OWNERS~~ [Benjamin2037]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@tangenta: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| pull-br-integration-test | 28c1d2d0123cb9eb62733fd700753b7c281d59e9 | link | true | /test pull-br-integration-test |
| pull-lightning-integration-test | 28c1d2d0123cb9eb62733fd700753b7c281d59e9 | link | true | /test pull-lightning-integration-test |
| pull-unit-test-ddlv1 | 28c1d2d0123cb9eb62733fd700753b7c281d59e9 | link | true | /test pull-unit-test-ddlv1 |
| pull-integration-e2e-test | 28c1d2d0123cb9eb62733fd700753b7c281d59e9 | link | true | /test pull-integration-e2e-test |
| pull-integration-realcluster-test-next-gen | 28c1d2d0123cb9eb62733fd700753b7c281d59e9 | link | true | /test pull-integration-realcluster-test-next-gen |
| pull-unit-test-next-gen | 28c1d2d0123cb9eb62733fd700753b7c281d59e9 | link | true | /test pull-unit-test-next-gen |
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.