tidb icon indicating copy to clipboard operation
tidb copied to clipboard

util/dbterror: refine reorg retryable errors

Open tangenta opened this issue 1 year ago • 11 comments

What problem does this PR solve?

Issue Number: close #52805

Problem Summary:

[2024/04/20 07:11:41.700 +08:00] [ERROR] [local.go:1231] ["scan region failed"] [error="scan regions from start-key:7480000000000000FFF55F698000000000FF0000570130333532FF35373637FF383237FF2D36393932FF3132FF33383733362DFF32FF30393239363931FFFF3835392D30393837FFFF35343532313237FF2DFF373432313939FF3330FF3138332D33FF363238FF33323834FF3839362DFF393439FF3433363535FF3439FF312D38393330FF37FF3633373839382DFFFF3133313031393538FFFF3631302D313337FF37FF333239333633FF3200FE0380000000FF024DDEEB00000000FB, err: rpc error: code = Canceled desc = context canceled: [BR:PD:ErrPDBatchScanRegion]batch scan region"] [region_len=0] [startKey=7480000000000000FFF55F698000000000FF0000570130333532FF35373637FF383237FF2D36393932FF3132FF33383733362DFF32FF30393239363931FFFF3835392D30393837FFFF35343532313237FF2DFF373432313939FF3330FF3138332D33FF363238FF33323834FF3839362DFF393439FF3433363535FF3439FF312D38393330FF37FF3633373839382DFFFF3133313031393538FFFF3631302D313337FF37FF333239333633FF3200FE0380000000FF024DDEEB00000000FB] [endKey=7480000000000000FFF55F698000000000FF0000570130343637FF32353932FF313937FF2D39393039FF3639FF33323933372DFF30FF31323038373634FFFF3038392D37303539FFFF32373837343433FF2DFF343339303438FF3233FF3033372D36FF393739FF30393934FF3733382DFF303932FF3732353530FF3531FF352D32313035FF31FF3738353837352DFFFF3633303237353838FFFF3137352D373631FF35FF353832333132FF3800FE0380000000FF03C80D0600000000FC]
[2024/04/20 07:11:41.700 +08:00] [ERROR] [local.go:1231] ["scan region failed"] [error="scan regions from start-key:7480000000000000FFF55F698000000000FF0000570130393736FF34323134FF383637FF2D30303139FF3639FF39343932302DFF33FF37313032303831FFFF3936332D36353937FFFF35333033323136FF2DFF333933383837FF3438FF3232352D36FF353534FF33333235FF3932392DFF393430FF3136303632FF3036FF342D39363136FF33FF3439343637302DFFFF3734353331303131FFFF3631332D313638FF31FF333138333532FF3700FE0380000000FF036789D400000000FB, err: rpc error: code = Canceled desc = context canceled: [BR:PD:ErrPDBatchScanRegion]batch scan region"] [region_len=0] [startKey=7480000000000000FFF55F698000000000FF0000570130393736FF34323134FF383637FF2D30303139FF3639FF39343932302DFF33FF37313032303831FFFF3936332D36353937FFFF35333033323136FF2DFF333933383837FF3438FF3232352D36FF353534FF33333235FF3932392DFF393430FF3136303632FF3036FF342D39363136FF33FF3439343637302DFFFF3734353331303131FFFF3631332D313638FF31FF333138333532FF3700FE0380000000FF036789D400000000FB] [endKey=7480000000000000FFF55F698000000000FF0000570131303534FF39343737FF353933FF2D33393439FF3235FF30323932362DFF33FF32313338363939FFFF3237312D30313037FFFF30353533393733FF2DFF333338303433FF3733FF3437362D31FF323032FF37343838FF3430362DFF363333FF3730313639FF3235FF382D38393834FF30FF3130393637342DFFFF3236373632353835FFFF3633302D343939FF35FF353332393339FF3600FE0380000000FF00CF6E1E00000000FC]
[2024/04/20 07:11:41.703 +08:00] [WARN] [local.go:1374] ["meet retryable error when writing to TiKV"] [error="peer 153040, store 4, region 153039, epoch conf_ver:287 version:10326 : EOF"] ["job stage"=regionScanned]
[2024/04/20 07:11:41.703 +08:00] [ERROR] [local.go:1231] ["scan region failed"] [error="scan regions from start-key:7480000000000000FFF55F698000000000FF0000570131393530FF33323235FF363437FF2D31303439FF3636FF35353334342DFF30FF30373731363936FFFF3039332D36303334FFFF36363739343034FF2DFF343938383639FF3531FF3838322D39FF303336FF30313537FF3630362DFF353434FF3036343438FF3437FF392D31333631FF39FF3736343336342DFFFF3738313037373134FFFF3435352D303633FF32FF383831313131FF3500FE0380000000FF02A8A03F00000000FB, err: rpc error: code = Canceled desc = context canceled: [BR:PD:ErrPDBatchScanRegion]batch scan region"] [region_len=0] [startKey=7480000000000000FFF55F698000000000FF0000570131393530FF33323235FF363437FF2D31303439FF3636FF35353334342DFF30FF30373731363936FFFF3039332D36303334FFFF36363739343034FF2DFF343938383639FF3531FF3838322D39FF303336FF30313537FF3630362DFF353434FF3036343438FF3437FF392D31333631FF39FF3736343336342DFFFF3738313037373134FFFF3435352D303633FF32FF383831313131FF3500FE0380000000FF02A8A03F00000000FB] [endKey=7480000000000000FFF55F698000000000FF0000570132303331FF31323235FF333636FF2D30393236FF3630FF35383931322DFF30FF38353537333733FFFF3832382D39303835FFFF30363934343435FF2DFF353835383434FF3831FF3933352D30FF393534FF31313136FF3332362DFF333035FF3435303633FF3733FF332D39323236FF32FF3835313138352DFFFF3334373739303432FFFF3630352D303134FF39FF333634353038FF3500FE0380000000FF0144230600000000FC]
[2024/04/20 07:11:41.717 +08:00] [WARN] [local.go:1374] ["meet retryable error when writing to TiKV"] [error="peer 153068, store 4, region 153067, epoch conf_ver:287 version:10326 : EOF"] ["job stage"=regionScanned]
[2024/04/20 07:11:41.718 +08:00] [ERROR] [local.go:1231] ["scan region failed"] [error="scan regions from start-key:7480000000000000FFF55F698000000000FF0000570132353430FF32383938FF363733FF2D36363436FF3433FF31353039392DFF31FF32303035383131FFFF3630342D35373038FFFF33373335353035FF2DFF313134313633FF3932FF3436302D33FF343535FF32383638FF3433352DFF393630FF3434313136FF3430FF312D39393930FF32FF3636343030342DFFFF3338313139333330FFFF3633342D323439FF39FF363334343938FF3900FE0380000000FF0219229700000000FB, err: rpc error: code = Canceled desc = context canceled: [BR:PD:ErrPDBatchScanRegion]batch scan region"] [region_len=0] [startKey=7480000000000000FFF55F698000000000FF0000570132353430FF32383938FF363733FF2D36363436FF3433FF31353039392DFF31FF32303035383131FFFF3630342D35373038FFFF33373335353035FF2DFF313134313633FF3932FF3436302D33FF343535FF32383638FF3433352DFF393630FF3434313136FF3430FF312D39393930FF32FF3636343030342DFFFF3338313139333330FFFF3633342D323439FF39FF363334343938FF3900FE0380000000FF0219229700000000FB] [endKey=7480000000000000FFF55F698000000000FF0000570132363139FF34373135FF343633FF2D39323036FF3737FF31343432302DFF32FF32373031383138FFFF3336372D38383937FFFF31313739343335FF2DFF303239363030FF3434FF3432352D38FF353835FF31303432FF3930352DFF343937FF3935383336FF3331FF302D34303332FF33FF3133323430382DFFFF3838393737373432FFFF3432362D313035FF33FF303831303132FF3800FE0380000000FF01BE445E00000000FC]
[2024/04/20 07:11:41.720 +08:00] [WARN] [local.go:1374] ["meet retryable error when writing to TiKV"] [error="peer 152960, store 4, region 152959, epoch conf_ver:293 version:10326 : EOF"] ["job stage"=regionScanned]
[2024/04/20 07:11:41.721 +08:00] [ERROR] [local.go:1231] ["scan region failed"] [error="scan regions from start-key:7480000000000000FFF55F698000000000FF0000570130313935FF32363039FF323336FF2D31313032FF3639FF38333938372DFF38FF36303133373338FFFF3334372D32313235FFFF39383034383137FF2DFF383033323137FF3336FF3436322D36FF393032FF34333930FF3433332DFF303633FF3934353431FF3430FF382D38393435FF38FF3231323538332DFFFF3637383137373430FFFF3034302D333134FF34FF363037343834FF3200FE0380000000FF01006B1D00000000FB, err: rpc error: code = Canceled desc = context canceled: [BR:PD:ErrPDBatchScanRegion]batch scan region"] [region_len=0] [startKey=7480000000000000FFF55F698000000000FF0000570130313935FF32363039FF323336FF2D31313032FF3639FF38333938372DFF38FF36303133373338FFFF3334372D32313235FFFF39383034383137FF2DFF383033323137FF3336FF3436322D36FF393032FF34333930FF3433332DFF303633FF3934353431FF3430FF382D38393435FF38FF3231323538332DFFFF3637383137373430FFFF3034302D333134FF34FF363037343834FF3200FE0380000000FF01006B1D00000000FB] [endKey=7480000000000000FFF55F698000000000FF0000570130323733FF36343633FF343336FF2D30333539FF3437FF39363635332DFF37FF33303237373738FFFF3437342D38323531FFFF36373138393933FF2DFF323834383636FF3932FF3235382D31FF343437FF33313835FF3435302DFF323532FF3537373335FF3537FF342D32343437FF36FF3633363638372DFFFF3131363938313230FFFF3139392D393434FF30FF353039303832FF3600FE0380000000FF03F6F9AA00000000FC]
[2024/04/20 07:11:41.721 +08:00] [ERROR] [local.go:1582] ["do import meets error"] [error="scan regions from start-key:7480000000000000FFF55F698000000000FF0000570132313039FF37353737FF363839FF2D35373339FF3233FF35313334312DFF30FF37383038313338FFFF3733332D30363238FFFF34323835373036FF2DFF383830383432FF3436FF3238332D39FF343331FF33363332FF3330302DFF313831FF3635313430FF3530FF322D33323537FF35FF3730303530382DFFFF3639343936353336FFFF3837372D323231FF33FF393139393835FF3300FE0380000000FF0214DF5D00000000FB, err: rpc error: code = Canceled desc = context canceled: [BR:PD:ErrPDBatchScanRegion]batch scan region"]
[2024/04/20 07:11:41.721 +08:00] [ERROR] [backend.go:354] ["import failed"] [engineTag=sbtest1:87] [engineUUID=e267486f-a714-5042-9074-57ca82545a76] [retryCnt=0] [takeTime=1.354089568s] [error="scan regions from start-key:7480000000000000FFF55F698000000000FF0000570132313039FF37353737FF363839FF2D35373339FF3233FF35313334312DFF30FF37383038313338FFFF3733332D30363238FFFF34323835373036FF2DFF383830383432FF3436FF3238332D39FF343331FF33363332FF3330302DFF313831FF3635313430FF3530FF322D33323537FF35FF3730303530382DFFFF3639343936353336FFFF3837372D323231FF33FF393139393835FF3300FE0380000000FF0214DF5D00000000FB, err: rpc error: code = Canceled desc = context canceled: [BR:PD:ErrPDBatchScanRegion]batch scan region"]
[2024/04/20 07:11:41.721 +08:00] [ERROR] [engine.go:143] ["[ddl-ingest] ingest data into storage error"] [error="scan regions from start-key:7480000000000000FFF55F698000000000FF0000570132313039FF37353737FF363839FF2D35373339FF3233FF35313334312DFF30FF37383038313338FFFF3733332D30363238FFFF34323835373036FF2DFF383830383432FF3436FF3238332D39FF343331FF33363332FF3330302DFF313831FF3635313430FF3530FF322D33323537FF35FF3730303530382DFFFF3639343936353336FFFF3837372D323231FF33FF393139393835FF3300FE0380000000FF0214DF5D00000000FB, err: rpc error: code = Canceled desc = context canceled: [BR:PD:ErrPDBatchScanRegion]batch scan region"] ["job ID"=586] ["index ID"=87]
[2024/04/20 07:11:41.721 +08:00] [WARN] [index.go:940] ["[ddl] lightning import error"] [error="scan regions from start-key:7480000000000000FFF55F698000000000FF0000570132313039FF37353737FF363839FF2D35373339FF3233FF35313334312DFF30FF37383038313338FFFF3733332D30363238FFFF34323835373036FF2DFF383830383432FF3436FF3238332D39FF343331FF33363332FF3330302DFF313831FF3635313430FF3530FF322D33323537FF35FF3730303530382DFFFF3639343936353336FFFF3837372D323231FF33FF393139393835FF3300FE0380000000FF0214DF5D00000000FB, err: rpc error: code = Canceled desc = context canceled: [BR:PD:ErrPDBatchScanRegion]batch scan region"]
[2024/04/20 07:11:41.721 +08:00] [WARN] [terror.go:242] ["Unknown error class"] [class=BR]

What changed and how does it work?

Make "ErrPDBatchScanRegion" retryable.

Check List

Tests

  • [ ] Unit test
  • [ ] Integration test
  • [x] Manual test (add detailed scripts or steps below)
  • [ ] No need to test
    • [ ] I checked and no code files have been changed.

Side effects

  • [ ] Performance regression: Consumes more CPU
  • [ ] Performance regression: Consumes more Memory
  • [ ] Breaking backward compatibility

Documentation

  • [ ] Affects user behaviors
  • [ ] Contains syntax changes
  • [ ] Contains variable changes
  • [ ] Contains experimental features
  • [ ] Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

tangenta avatar Apr 22 '24 08:04 tangenta

Hi @tangenta. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tiprow[bot] avatar Apr 22 '24 08:04 tiprow[bot]

Codecov Report

:x: Patch coverage is 96.00000% with 1 line in your changes missing coverage. Please review. :white_check_mark: Project coverage is 74.2885%. Comparing base (f304f04) to head (28c1d2d). :warning: Report is 3396 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #52808        +/-   ##
================================================
+ Coverage   72.4716%   74.2885%   +1.8168%     
================================================
  Files          1491       1533        +42     
  Lines        429012     450600     +21588     
================================================
+ Hits         310912     334744     +23832     
+ Misses        98878      94936      -3942     
- Partials      19222      20920      +1698     
Flag Coverage Δ
integration 50.4724% <88.0000%> (?)
unit 72.0235% <68.0000%> (+0.0512%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 78.1914% <ø> (∅)
parser ∅ <ø> (∅)
br 47.5259% <ø> (+5.4104%) :arrow_up:
:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov[bot] avatar Apr 22 '24 09:04 codecov[bot]

/retest

tangenta avatar Apr 23 '24 06:04 tangenta

@tangenta: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tiprow[bot] avatar Apr 23 '24 06:04 tiprow[bot]

[LGTM Timeline notifier]

Timeline:

  • 2024-04-23 06:29:40.468212715 +0000 UTC m=+68937.208115625: :ballot_box_with_check: agreed by ywqzzy.
  • 2024-04-23 06:43:04.454488589 +0000 UTC m=+69741.194391501: :ballot_box_with_check: agreed by wjhuang2016.

ti-chi-bot[bot] avatar Apr 23 '24 06:04 ti-chi-bot[bot]

/retest

Benjamin2037 avatar Apr 25 '24 14:04 Benjamin2037

@Benjamin2037: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tiprow[bot] avatar Apr 25 '24 14:04 tiprow[bot]

/hold

I think the root cause is DDL canceled the context for subtask, but didn't cancel the context to save the status of subtask. they should be failed at the same time.

lance6716 avatar May 09 '24 07:05 lance6716

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Benjamin2037, wjhuang2016, ywqzzy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot[bot] avatar May 11 '24 02:05 ti-chi-bot[bot]

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ti-chi-bot[bot] avatar Jul 05 '24 22:07 ti-chi-bot[bot]

@tangenta: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-br-integration-test 28c1d2d0123cb9eb62733fd700753b7c281d59e9 link true /test pull-br-integration-test
pull-lightning-integration-test 28c1d2d0123cb9eb62733fd700753b7c281d59e9 link true /test pull-lightning-integration-test
pull-unit-test-ddlv1 28c1d2d0123cb9eb62733fd700753b7c281d59e9 link true /test pull-unit-test-ddlv1
pull-integration-e2e-test 28c1d2d0123cb9eb62733fd700753b7c281d59e9 link true /test pull-integration-e2e-test
pull-integration-realcluster-test-next-gen 28c1d2d0123cb9eb62733fd700753b7c281d59e9 link true /test pull-integration-realcluster-test-next-gen
pull-unit-test-next-gen 28c1d2d0123cb9eb62733fd700753b7c281d59e9 link true /test pull-unit-test-next-gen

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

ti-chi-bot[bot] avatar Sep 26 '25 16:09 ti-chi-bot[bot]