tidb-lightning
tidb-lightning copied to clipboard
Error when importing: region 46 is not fully replicated
Bug Report
Please answer these questions before submitting your issue. Thanks!
- What did you do? If possible, provide a recipe for reproducing the error.
I have a shell script where I import a ~80GiB sample dataset into TiDB using lightning for testing. It uses a non-supported configuration of 1x TiDB 1x TiKV 1x PD:
rm -rf data/pd
rm -rf data/tikv
./bin/tidb-server --path="127.0.0.1:2379" -store tikv &
./bin/pd-server --data-dir=data/pd --log-file=logs/pd.log &
./bin/tikv-server --pd="127.0.0.1:2379" --data-dir=data/tikv -A 127.0.0.1:20165 --log-file=logs/tikv.log &
./bin/tikv-importer --import-dir tmp --log-level info &
./bin/tidb-lightning -d data-sets/ontime-data -importer localhost:20160
- What did you expect to see?
It was actually working without error until recently. Now it repeatedly gives errors about region not fully replicated (of course not: it's a single tikv server). It still manages to complete though, and all the data is there.
So technically these should be classed as warnings, not errors. But it would be nice-to-have if it knew from pd that I only have one copy of the data intentionally.
- What did you see instead?
[2020/01/10 13:13:29.352 -07:00] [INFO] [restore.go:1792] ["restore file completed"] [table=`ontime`.`ontime`] [engineNumber=0] [fileIndex=313] [path=/mnt/evo860/data-sets/ontime-data/ontime.ontime.00001.sql:0] [readDur=11.14167766s] [encodeDur=35.341429547s] [deliverDur=8.332825462s] [checksum="{cksum=10892752876034267670,size=365583895,kvs=587544}"] [takeTime=36.233777719s] []
[2020/01/10 13:13:29.352 -07:00] [INFO] [restore.go:928] ["encode kv data and write completed"] [table=`ontime`.`ontime`] [engineNumber=0] [read=80539338461] [written=114560022511] [takeTime=15m55.575505508s] []
[2020/01/10 13:13:29.352 -07:00] [INFO] [backend.go:266] ["engine close start"] [engineTag=`ontime`.`ontime`:0] [engineUUID=9d2626e1-dca4-5efb-a07a-8336c493a62a]
[2020/01/10 13:13:29.467 -07:00] [INFO] [kv_importer.rs:103] ["close engine"] [engine="EngineFile { uuid: 9d2626e1-dca4-5efb-a07a-8336c493a62a, path: EnginePath { save: \"/mnt/evo860/tmp/9d2626e1-dca4-5efb-a07a-8336c493a62a\", temp: \"/mnt/evo860/tmp/.temp/9d2626e1-dca4-5efb-a07a-8336c493a62a\" } }"]
[2020/01/10 13:13:29.467 -07:00] [INFO] [backend.go:268] ["engine close completed"] [engineTag=`ontime`.`ontime`:0] [engineUUID=9d2626e1-dca4-5efb-a07a-8336c493a62a] [takeTime=114.822851ms] []
[2020/01/10 13:13:29.467 -07:00] [INFO] [restore.go:784] ["restore engine completed"] [table=`ontime`.`ontime`] [engineNumber=0] [takeTime=15m55.690475619s] []
[2020/01/10 13:13:29.467 -07:00] [INFO] [restore.go:1403] ["import and cleanup engine start"] [engineTag=`ontime`.`ontime`:0] [engineUUID=9d2626e1-dca4-5efb-a07a-8336c493a62a]
[2020/01/10 13:13:29.467 -07:00] [INFO] [backend.go:280] ["import start"] [engineTag=`ontime`.`ontime`:0] [engineUUID=9d2626e1-dca4-5efb-a07a-8336c493a62a] [retryCnt=0]
[2020/01/10 13:13:29.468 -07:00] [INFO] [util.rs:396] ["connecting to PD endpoint"] [endpoints=127.0.0.1:2379]
[2020/01/10 13:13:29.468 -07:00] [INFO] [<unknown>] ["New connected subchannel at 0x7efb9c84b150 for subchannel 0x7efb9c42f000"]
[2020/01/10 13:13:29.471 -07:00] [INFO] [util.rs:396] ["connecting to PD endpoint"] [endpoints=http://127.0.0.1:2379]
[2020/01/10 13:13:29.471 -07:00] [INFO] [<unknown>] ["New connected subchannel at 0x7efb9c84b120 for subchannel 0x7efb9c42f400"]
[2020/01/10 13:13:29.472 -07:00] [INFO] [util.rs:396] ["connecting to PD endpoint"] [endpoints=http://127.0.0.1:2379]
[2020/01/10 13:13:29.472 -07:00] [INFO] [<unknown>] ["New connected subchannel at 0x7efb9c84b0c0 for subchannel 0x7efb9c42f800"]
[2020/01/10 13:13:29.472 -07:00] [INFO] [util.rs:455] ["connected to PD leader"] [endpoints=http://127.0.0.1:2379]
[2020/01/10 13:13:29.472 -07:00] [INFO] [util.rs:384] ["all PD endpoints are consistent"] [endpoints="[\"127.0.0.1:2379\"]"]
[2020/01/10 13:13:32.195 -07:00] [INFO] [import.rs:56] [start] [tag="[ImportJob 9d2626e1-dca4-5efb-a07a-8336c493a62a]"]
[2020/01/10 13:13:32.195 -07:00] [INFO] [prepare.rs:76] [start] [tag="[PrepareJob 9d2626e1-dca4-5efb-a07a-8336c493a62a]"]
[2020/01/10 13:13:32.206 -07:00] [INFO] [prepare.rs:80] ["get size properties"] [size=117503282223] [tag="[PrepareJob 9d2626e1-dca4-5efb-a07a-8336c493a62a]"]
[2020/01/10 13:13:32.206 -07:00] [INFO] [prepare.rs:172] [start] [range="RangeInfo { range: end: 7480000000000000FF2F5F728000000000FF2B0C2C0000000000FAFFFFFFFFA1E728D2, size: 536942611 }"] [tag="[PrepareRangeJob 9d2626e1-dca4-5efb-a07a-8336c493a62a:0]"]
[2020/01/10 13:13:32.206 -07:00] [INFO] [prepare.rs:190] [prepare] [takes=394.169µs] [tag="[PrepareRangeJob 9d2626e1-dca4-5efb-a07a-8336c493a62a:0]"]
[2020/01/10 13:13:32.206 -07:00] [INFO] [prepare.rs:172] [start] [range="RangeInfo { range: start: 7480000000000000FF2F5F728000000000FF2B0C2C0000000000FAFFFFFFFFA1E728D2 end: 7480000000000000FF2F5F728000000000FF6C0C880000000000FAFFFFFFFFA1E728D2, size: 536909914 }"] [tag="[PrepareRangeJob 9d2626e1-dca4-5efb-a07a-8336c493a62a:1]"]
[2020/01/10 13:13:32.207 -07:00] [INFO] [<unknown>] ["New connected subchannel at 0x7ef4bf44a150 for subchannel 0x7efb9c42fc00"]
[2020/01/10 13:13:32.218 -07:00] [INFO] [prepare.rs:288] [split] [at=7480000000000000FF2F5F728000000000FF6C0C880000000000FAFFFFFFFFA1E728D2] [region="RegionInfo { region: id: 2 start_key: 7480000000000000FF2F00000000000000F8 region_epoch { conf_ver: 1 version: 22 } peers { id: 3 store_id: 1 }, leader: Some(id: 3 store_id: 1) }"] [tag="[PrepareRangeJob 9d2626e1-dca4-5efb-a07a-8336c493a62a:1]"]
[2020/01/10 13:13:32.237 -07:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(2), details: Some(\"region 46 is not fully replicated\") }))"]
[2020/01/10 13:13:32.237 -07:00] [INFO] [util.rs:396] ["connecting to PD endpoint"] [endpoints=http://127.0.0.1:2379]
[2020/01/10 13:13:32.237 -07:00] [INFO] [util.rs:396] ["connecting to PD endpoint"] [endpoints=http://127.0.0.1:2379]
[2020/01/10 13:13:32.238 -07:00] [INFO] [util.rs:455] ["connected to PD leader"] [endpoints=http://127.0.0.1:2379]
[2020/01/10 13:13:32.238 -07:00] [INFO] [util.rs:175] ["heartbeat sender and receiver are stale, refreshing ..."]
[2020/01/10 13:13:32.238 -07:00] [WARN] [util.rs:194] ["updating PD client done"] [spend=1.235315ms]
[2020/01/10 13:13:32.238 -07:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(2), details: Some(\"region 46 is not fully replicated\") }))"]
[2020/01/10 13:13:32.238 -07:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(2), details: Some(\"region 46 is not fully replicated\") }))"]
[2020/01/10 13:13:32.238 -07:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(2), details: Some(\"region 46 is not fully replicated\") }))"]
[2020/01/10 13:13:32.239 -07:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(2), details: Some(\"region 46 is not fully replicated\") }))"]
[2020/01/10 13:13:32.239 -07:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(2), details: Some(\"region 46 is not fully replicated\") }))"]
[2020/01/10 13:13:32.239 -07:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(2), details: Some(\"region 46 is not fully replicated\") }))"]
[2020/01/10 13:13:32.239 -07:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(2), details: Some(\"region 46 is not fully replicated\") }))"]
[2020/01/10 13:13:32.239 -07:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(2), details: Some(\"region 46 is not fully replicated\") }))"]
[2020/01/10 13:13:32.239 -07:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(2), details: Some(\"region 46 is not fully replicated\") }))"]
[2020/01/10 13:13:32.240 -07:00] [WARN] [prepare.rs:313] ["scatter region failed"] [err="[/rust/git/checkouts/tikv-71ea2042335c4528/2e0379c/components/pd_client/src/util.rs:334]: fail to request"] [region=46] [tag="[PrepareRangeJob 9d2626e1-dca4-5efb-a07a-8336c493a62a:1]"]
[2020/01/10 13:13:33.240 -07:00] [INFO] [prepare.rs:190] [prepare] [takes=1.033672901s] [tag="[PrepareRangeJob 9d2626e1-dca4-5efb-a07a-8336c493a62a:1]"]
[2020/01/10 13:13:33.240 -07:00] [INFO] [prepare.rs:172] [start] [range="RangeInfo { range: start: 7480000000000000FF2F5F728000000000FF6C0C880000000000FAFFFFFFFFA1E728D2 end: 7480000000000000FF2F5F728000000000FF92A2540000000000FAFFFFFFFFA1E728D2, size: 536913135 }"] [tag="[PrepareRangeJob 9d2626e1-dca4-5efb-a07a-8336c493a62a:2]"]
[2020/01/10 13:13:33.253 -07:00] [INFO] [prepare.rs:288] [split] [at=7480000000000000FF2F5F728000000000FF92A2540000000000FAFFFFFFFFA1E728D2] [region="RegionInfo { region: id: 2 start_key: 7480000000000000FF2F5F728000000000FF6C0C880000000000FA region_epoch { conf_ver: 1 version: 23 } peers { id: 3 store_id: 1 }, leader: Some(id: 3 store_id: 1) }"] [tag="[PrepareRangeJob 9d2626e1-dca4-5efb-a07a-8336c493a62a:2]"]
[2020/01/10 13:13:33.270 -07:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(2), details: Some(\"region 48 is not fully replicated\") }))"]
[2020/01/10 13:13:33.270 -07:00] [INFO] [util.rs:396] ["connecting to PD endpoint"] [endpoints=http://127.0.0.1:2379]
[2020/01/10 13:13:33.271 -07:00] [INFO] [util.rs:396] ["connecting to PD endpoint"] [endpoints=http://127.0.0.1:2379]
[2020/01/10 13:13:33.271 -07:00] [INFO] [util.rs:455] ["connected to PD leader"] [endpoints=http://127.0.0.1:2379]
[2020/01/10 13:13:33.271 -07:00] [INFO] [util.rs:175] ["heartbeat sender and receiver are stale, refreshing ..."]
[2020/01/10 13:13:33.271 -07:00] [WARN] [util.rs:194] ["updating PD client done"] [spend=1.108416ms]
[2020/01/10 13:13:33.271 -07:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(2), details: Some(\"region 48 is not fully replicated\") }))"]
[2020/01/10 13:13:33.272 -07:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(2), details: Some(\"region 48 is not fully replicated\") }))"]
[2020/01/10 13:13:33.272 -07:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(2), details: Some(\"region 48 is not fully replicated\") }))"]
[2020/01/10 13:13:33.272 -07:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(2), details: Some(\"region 48 is not fully replicated\") }))"]
[2020/01/10 13:13:33.272 -07:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(2), details: Some(\"region 48 is not fully replicated\") }))"]
[2020/01/10 13:13:33.272 -07:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(2), details: Some(\"region 48 is not fully replicated\") }))"]
[2020/01/10 13:13:33.272 -07:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(2), details: Some(\"region 48 is not fully replicated\") }))"]
[2020/01/10 13:13:33.273 -07:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(2), details: Some(\"region 48 is not fully replicated\") }))"]
[2020/01/10 13:13:33.273 -07:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(2), details: Some(\"region 48 is not fully replicated\") }))"]
[2020/01/10 13:13:33.273 -07:00] [WARN] [prepare.rs:313] ["scatter region failed"] [err="[/rust/git/checkouts/tikv-71ea2042335c4528/2e0379c/components/pd_client/src/util.rs:334]: fail to request"] [region=48] [tag="[PrepareRangeJob 9d2626e1-dca4-5efb-a07a-8336c493a62a:2]"]
[2020/01/10 13:13:34.273 -07:00] [INFO] [prepare.rs:190] [prepare] [takes=1.033316869s] [tag="[PrepareRangeJob 9d2626e1-dca4-5efb-a07a-8336c493a62a:2]"]
[2020/01/10 13:13:34.273 -07:00] [INFO] [prepare.rs:172] [start] [range="RangeInfo { range: start: 7480000000000000FF2F5F728000000000FF92A2540000000000FAFFFFFFFFA1E728D2 end: 7480000000000000FF2F5F728000000000FFB960C50000000000FAFFFFFFFFA1E728D2, size: 536909786 }"] [tag="[PrepareRangeJob 9d2626e1-dca4-5efb-a07a-8336c493a62a:3]"]
[2020/01/10 13:13:34.284 -07:00] [INFO] [prepare.rs:288] [split] [at=7480000000000000FF2F5F728000000000FFB960C50000000000FAFFFFFFFFA1E728D2] [region="RegionInfo { region: id: 2 start_key: 7480000000000000FF2F5F728000000000FF92A2540000000000FA region_epoch { conf_ver: 1 version: 24 } peers { id: 3 store_id: 1 }, leader: Some(id: 3 store_id: 1) }"] [tag="[PrepareRangeJob 9d2626e1-dca4-5efb-a07a-8336c493a62a:3]"]
[2020/01/10 13:13:34.301 -07:00] [ERROR] [util.rs:326] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: RpcStatusCode(2), details: Some(\"region 50 is not fully replicated\") }))"]
- Versions of the cluster
TiDB-Lightning version (run tidb-lightning -V
):
./tidb-lightning -V
Release Version: v3.0.7-4-g42b7585
Git Commit Hash: 42b7585afb40a90b46b60fab203b1e8c99c56a3f
Git Branch: master
UTC Build Time: 2020-01-10 06:17:20
Go Version: go version go1.12 linux/amd64
TiKV-Importer version (run tikv-importer -V
):
./tikv-importer -V
TiKV Importer 4.0.0-alpha
TiKV version (run tikv-server -V
):
./tikv-server -V
TiKV
Release Version: 4.0.0-alpha
Git Commit Hash: f61c09bb3943a5e1cd594993fa6886ed661e461e
Git Commit Branch: master
UTC Build Time: 2020-01-08 01:36:23
Rust Version: rustc 1.42.0-nightly (0de96d37f 2019-12-19)
Enable Features: jemalloc portable sse protobuf-codec
Profile: dist_release
TiDB cluster version (execute SELECT tidb_version();
in a MySQL client):
mysql> select tidb_version()\G
*************************** 1. row ***************************
tidb_version(): Release Version: v4.0.0-alpha-1334-g07e642c92
Git Commit Hash: 07e642c9230ccb7c1537b27442f1fe8433e65f8a
Git Branch: master
UTC Build Time: 2020-01-08 08:32:04
GoVersion: go1.13
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
1 row in set (0.00 sec)
Other interesting information (system version, hardware config, etc):
I can reproduce this on two different servers. Both 16 cores, 1 has an NVMe disk w/64GiB RAM, this one has a SATA disk and 32GiB RAM.
Thanks for the report. The error is that scattering failed. It is not fatal but the cluster would not be balanced initially.
PTAL @disksing The "not fully replicated" error for region scatter is introduced in pingcap/pd#1897, where the original error condition len(peers) != max_replicas
is replaced by the stricter len(learners) == 0 && len(peers) != max_replicas
. So there are some left-over learners.
@wwar I suppose the PD version is also latest
? Have you updated PD before the errors happened?
@wwar I suppose the PD version is also
latest
? Have you updated PD before the errors happened?
That is correct. I update fairly frequently from daily builds.
Is this bug still reproducible? shall we close this issue?
Still there against master today:
[2020/08/24 22:09:36.280 -06:00] [WARN] [util.rs:199] ["updating PD client done"] [spend=1.290002ms]
[2020/08/24 22:09:36.280 -06:00] [ERROR] [util.rs:332] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"region 244 is not fully replicated\") }))"]
[2020/08/24 22:09:36.280 -06:00] [ERROR] [util.rs:332] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"region 244 is not fully replicated\") }))"]
[2020/08/24 22:09:36.280 -06:00] [ERROR] [util.rs:332] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"region 244 is not fully replicated\") }))"]
[2020/08/24 22:09:36.281 -06:00] [ERROR] [util.rs:332] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"region 244 is not fully replicated\") }))"]
[2020/08/24 22:09:36.281 -06:00] [ERROR] [util.rs:332] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"region 244 is not fully replicated\") }))"]
[2020/08/24 22:09:36.281 -06:00] [ERROR] [util.rs:332] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"region 244 is not fully replicated\") }))"]
[2020/08/24 22:09:36.281 -06:00] [ERROR] [util.rs:332] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"region 244 is not fully replicated\") }))"]
[2020/08/24 22:09:36.281 -06:00] [ERROR] [util.rs:332] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"region 244 is not fully replicated\") }))"]
[2020/08/24 22:09:36.281 -06:00] [ERROR] [util.rs:332] ["request failed"] [err="Grpc(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"region 244 is not fully replicated\") }))"]
Since this is a non-fatal error and a expected behavior, so I think it will be ok to change the error level to WARN
to make it less confusing? @kennytm
@glorv that code is used not just by sst_importer, but the entire TiKV code base when interacting with PD. I think you'd better ask the PD team (sig-scheduling) for this.
@nolouch PTAL. https://github.com/tikv/tikv/blob/84466febe406809aab936197ec897b6adac36d70/components/pd_client/src/util.rs#L315 since this is a non-fatal error (because it is in a retry loop), it's better to change this log for ERROR
level to WARN
level to make it less confusing?