br
br copied to clipboard
Raw KV Backup is failing due to "Resolve lock timeout"
- What did you do?
We are trying to do a full backup of our raw kv store into our ceph instance using this command:
br backup raw --cf default --pd "http://127.0.0.1:2379" --s3.provider "ceph" --s3.endpoint "http://ceph-internal:7480" -s "s3://tikv-backups/<date>" --send-credentials-to-tikv=true --s3.region "none" --ratelimit 3 --log-file /var/log/tikv-backup.log --start 00 --end ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff
Our TikV database holds around 70 TB of data and a few rather large key value pairs (around 40 MB).
- What did you expect to see?
We expected the backup to run without errors.
- What did you see instead?
The backup fails with error="[tikv:9004]Resolve lock timeout"]
- What version of BR and TiDB/TiKV/PD are you using?
$ br -V
Release Version: v4.0.15
Git Commit Hash: 2ca9fa25f15aec061e281ad84f77dacf622ac9c3
Git Branch: heads/refs/tags/v4.0.15
Go Version: go1.13
UTC Build Time: 2021-09-23 09:48:49
Race Enabled: false
$ pd-server -V
Release Version: v4.0.15
Edition: Community
Git Commit Hash: 605dc238b9a9e54491a5e6b0843f30de33421402
Git Branch: heads/refs/tags/v4.0.15
UTC Build Time: 2021-09-23 09:48:07
$ tikv-server -V
TiKV
Release Version: 4.0.15
Edition: Community
Git Commit Hash: 0a0482dee0a0cf135f56b9f90b412ea641ec3ae2
Git Commit Branch: heads/refs/tags/v4.0.15
UTC Build Time: 2021-09-13 03:10:53
Rust Version: rustc 1.42.0-nightly (0de96d37f 2019-12-19)
Enable Features: jemalloc mem-profiling portable sse protobuf-codec
Profile: dist_release
- Operation logs
relevant br.log entries:
[2022/05/04 17:38:19.479 +00:00] [INFO] [client.go:896] ["try backup"] [startKey=74727864623B00FFFF3A14D6BEBABB6DFBC0128B656504E14837E614744D15CA7981586163F77B0939A37AA73064E1F9836986F096618B026ADCA0F1C7EB0CA9C2A075A4418DA9] [endKey=74727864623B01F16A2706A2140D46779368D753B72975EA2D98D1A9067185FF6DE94931CC446C] [storeID=649575725] ["retry time"=0]
[2022/05/04 17:38:24.135 +00:00] [WARN] [client.go:859] ["reset the connection in handleFineGrained"] [storeID=649575725]
[2022/05/04 17:38:24.135 +00:00] [INFO] [conn.go:277] ["Reset backup client"] [storeID=649575725]
[2022/05/04 17:38:54.136 +00:00] [WARN] [conn.go:291] ["failed to reset grpc connection, retry it"] ["retry time"=0] [error="[BR:Common:ErrFailedToConnect]failed to make connection to store 649575725"]
[2022/05/04 17:39:09.362 +00:00] [INFO] [client.go:896] ["try backup"] [startKey=74727864623B00FFFF3A14D6BEBABB6DFBC0128B656504E14837E614744D15CA7981586163F77B0939A37AA73064E1F9836986F096618B026ADCA0F1C7EB0CA9C2A075A4418DA9] [endKey=74727864623B01F16A2706A2140D46779368D753B72975EA2D98D1A9067185FF6DE94931CC446C] [storeID=649575725] ["retry time"=1]
[2022/05/04 17:39:09.401 +00:00] [INFO] [client.go:951] ["backup streaming finish"] [StoreID=649575725] ["retry time"=1]
[2022/05/04 17:39:09.415 +00:00] [INFO] [client.go:713] ["handle fine grained"] [backoffMs=10000]
[2022/05/04 17:39:09.438 +00:00] [WARN] [backoff.go:329] ["txnLock backoffer.maxSleep 80000ms is exceeded, errors:\n[BR:Common:ErrUnknown]internal error at 2022-05-04T17:37:18.643773397Z\n[BR:Common:ErrUnknown]internal error at 2022-05-04T17:38:16.265197102Z\n[BR:Common:ErrUnknown]internal error at 2022-05-04T17:39:09.415374193Z"]
[2022/05/04 17:39:09.448 +00:00] [INFO] [client.go:511] ["backup range finished"] [take=51h16m53.518837989s]
[2022/05/04 17:39:09.670 +00:00] [INFO] [collector.go:195] ["Raw backup failed summary"] [total-ranges=1] [ranges-succeed=0] [ranges-failed=1] [backup-total-regions=1121414] [unit-name="range start:00 end:ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff"] [error="[tikv:9004]Resolve lock timeout"] [errorVerbose="[tikv:9004]Resolve lock timeout\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/br/pkg/backup.(*Client).fineGrainedBackup\n\tgithub.com/pingcap/br@/pkg/backup/client.go:718\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRange\n\tgithub.com/pingcap/br@/pkg/backup/client.go:544\ngithub.com/pingcap/br/pkg/task.RunBackupRaw\n\tgithub.com/pingcap/br@/pkg/task/backup_raw.go:202\nmain.runBackupRawCommand\n\tgithub.com/pingcap/br@/cmd/br/backup.go:44\nmain.newRawBackupCommand.func1\n\tgithub.com/pingcap/br@/cmd/br/backup.go:136\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/[email protected]/command.go:842\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/[email protected]/command.go:950\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/[email protected]/command.go:887\nmain.main\n\tgithub.com/pingcap/br@/cmd/br/main.go:56\nruntime.main\n\truntime/proc.go:203\nruntime.goexit\n\truntime/asm_amd64.s:1357"]
[2022/05/04 17:39:09.714 +00:00] [ERROR] [backup.go:45] ["failed to backup raw kv"] [error="[tikv:9004]Resolve lock timeout"] [errorVerbose="[tikv:9004]Resolve lock timeout\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/br/pkg/backup.(*Client).fineGrainedBackup\n\tgithub.com/pingcap/br@/pkg/backup/client.go:718\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRange\n\tgithub.com/pingcap/br@/pkg/backup/client.go:544\ngithub.com/pingcap/br/pkg/task.RunBackupRaw\n\tgithub.com/pingcap/br@/pkg/task/backup_raw.go:202\nmain.runBackupRawCommand\n\tgithub.com/pingcap/br@/cmd/br/backup.go:44\nmain.newRawBackupCommand.func1\n\tgithub.com/pingcap/br@/cmd/br/backup.go:136\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/[email protected]/command.go:842\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/[email protected]/command.go:950\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/[email protected]/command.go:887\nmain.main\n\tgithub.com/pingcap/br@/cmd/br/main.go:56\nruntime.main\n\truntime/proc.go:203\nruntime.goexit\n\truntime/asm_amd64.s:1357"] [stack="main.runBackupRawCommand\n\tgithub.com/pingcap/br@/cmd/br/backup.go:45\nmain.newRawBackupCommand.func1\n\tgithub.com/pingcap/br@/cmd/br/backup.go:136\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/[email protected]/command.go:842\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/[email protected]/command.go:950\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/[email protected]/command.go:887\nmain.main\n\tgithub.com/pingcap/br@/cmd/br/main.go:56\nruntime.main\n\truntime/proc.go:203"]
[2022/05/04 17:39:09.730 +00:00] [ERROR] [main.go:58] ["br failed"] [error="[tikv:9004]Resolve lock timeout"] [errorVerbose="[tikv:9004]Resolve lock timeout\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/br/pkg/backup.(*Client).fineGrainedBackup\n\tgithub.com/pingcap/br@/pkg/backup/client.go:718\ngithub.com/pingcap/br/pkg/backup.(*Client).BackupRange\n\tgithub.com/pingcap/br@/pkg/backup/client.go:544\ngithub.com/pingcap/br/pkg/task.RunBackupRaw\n\tgithub.com/pingcap/br@/pkg/task/backup_raw.go:202\nmain.runBackupRawCommand\n\tgithub.com/pingcap/br@/cmd/br/backup.go:44\nmain.newRawBackupCommand.func1\n\tgithub.com/pingcap/br@/cmd/br/backup.go:136\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/[email protected]/command.go:842\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/[email protected]/command.go:950\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/[email protected]/command.go:887\nmain.main\n\tgithub.com/pingcap/br@/cmd/br/main.go:56\nruntime.main\n\truntime/proc.go:203\nruntime.goexit\n\truntime/asm_amd64.s:1357"] [stack="main.main\n\tgithub.com/pingcap/br@/cmd/br/main.go:58\nruntime.main\n\truntime/proc.go:203"]
-
Configuration of the cluster and the task
-
Screenshot/exported-PDF of Grafana dashboard or metrics' graph in Prometheus if possible