tidb-lightning icon indicating copy to clipboard operation
tidb-lightning copied to clipboard

tidb-lightning does not wait for tikv-importer to start correctly

Open wwar opened this issue 5 years ago • 0 comments

Feature Request

Is your feature request related to a problem? Please describe:

If tidb-lightning is started too quickly after tikv-importer, it will lead to a connection error like this one:

[2020/01/25 14:29:54.578 -07:00] [INFO] [lightning.go:194] ["load data source start"]
[2020/01/25 14:29:54.581 -07:00] [INFO] [lightning.go:197] ["load data source completed"] [takeTime=3.600785ms] []
[2020/01/25 14:29:54.582 -07:00] [INFO] [restore.go:251] ["the whole procedure start"]
[2020/01/25 14:29:54.587 -07:00] [INFO] [restore.go:289] ["restore table schema start"] [db=ontime]
[2020/01/25 14:29:54.635 -07:00] [INFO] [tikv-importer.rs:40] ["Welcome to TiKV Importer."]
[2020/01/25 14:29:54.635 -07:00] [INFO] [tikv-importer.rs:42] []
[2020/01/25 14:29:54.635 -07:00] [INFO] [tikv-importer.rs:42] ["Release Version:   4.0.0-beta"]
[2020/01/25 14:29:54.635 -07:00] [INFO] [tikv-importer.rs:42] ["Git Commit Hash:   53734730d7885ea3ade4ee92e433132d8e6c0f02"]
[2020/01/25 14:29:54.635 -07:00] [INFO] [tikv-importer.rs:42] ["Git Commit Branch: release-4.0"]
[2020/01/25 14:29:54.635 -07:00] [INFO] [tikv-importer.rs:42] ["UTC Build Time:    2020-01-17 02:37:07"]
[2020/01/25 14:29:54.635 -07:00] [INFO] [tikv-importer.rs:42] ["Rust Version:      rustc 1.42.0-nightly (0de96d37f 2019-12-19)"]
[2020/01/25 14:29:54.636 -07:00] [INFO] [tikv-importer.rs:44] []
[2020/01/25 14:29:54.636 -07:00] [WARN] [lib.rs:528] ["environment variable `TZ` is missing, using `/etc/localtime`"]
[2020/01/25 14:29:54.639 -07:00] [INFO] [tikv-importer.rs:163] ["import server started"]
[2020/01/25 14:29:54.690 -07:00] [INFO] [tidb.go:99] ["create tables start"] [db=ontime]
[2020/01/25 14:29:54.851 -07:00] [INFO] [tidb.go:117] ["create tables completed"] [db=ontime] [takeTime=161.385782ms] []
[2020/01/25 14:29:54.851 -07:00] [INFO] [restore.go:297] ["restore table schema completed"] [db=ontime] [takeTime=264.193394ms] []
[2020/01/25 14:29:54.856 -07:00] [INFO] [restore.go:551] ["restore all tables data start"]
[2020/01/25 14:29:54.856 -07:00] [INFO] [restore.go:572] ["restore table start"] [table=`ontime`.`ontime`]
[2020/01/25 14:29:54.856 -07:00] [INFO] [restore.go:1314] ["load engines and files start"] [table=`ontime`.`ontime`]
[2020/01/25 14:29:54.857 -07:00] [INFO] [restore.go:1343] ["load engines and files completed"] [table=`ontime`.`ontime`] [enginesCnt=2] [filesCnt=314] [takeTime=954.206µs] []
[2020/01/25 14:29:54.858 -07:00] [ERROR] [restore.go:575] ["restore table failed"] [table=`ontime`.`ontime`] [takeTime=1.396899ms] [error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:8287: connect: connection refused\""]
[2020/01/25 14:29:54.858 -07:00] [ERROR] [restore.go:670] ["restore all tables data failed"] [takeTime=1.688492ms] [error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:8287: connect: connection refused\""]
[2020/01/25 14:29:54.858 -07:00] [ERROR] [restore.go:266] ["run failed"] [step=2] [error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:8287: connect: connection refused\""]
Error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:8287: connect: connection refused"
[2020/01/25 14:29:54.858 -07:00] [ERROR] [restore.go:272] ["the whole procedure failed"] [takeTime=275.872115ms] [error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:8287: connect: connection refused\""]
[2020/01/25 14:29:54.858 -07:00] [WARN] [restore.go:445] ["stopping periodic actions"] [error="context canceled"]
[2020/01/25 14:29:54.858 -07:00] [ERROR] [main.go:59] ["tidb lightning encountered error"] [error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:8287: connect: connection refused\""] [errorVerbose="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:8287: connect: connection refused\"\ngithub.com/pingcap/errors.AddStack\n\t/home/jenkins/agent/workspace/release_tidb_4.0/go/pkg/mod/github.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\t/home/jenkins/agent/workspace/release_tidb_4.0/go/pkg/mod/github.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*importer).OpenEngine\n\t/home/jenkins/agent/workspace/release_tidb_4.0/go/src/github.com/pingcap/tidb-lightning/lightning/backend/importer.go:111\ngithub.com/pingcap/tidb-lightning/lightning/backend.Backend.OpenEngine\n\t/home/jenkins/agent/workspace/release_tidb_4.0/go/src/github.com/pingcap/tidb-lightning/lightning/backend/backend.go:180\ngithub.com/pingcap/tidb-lightning/lightning/restore.(*TableRestore).restoreEngines\n\t/home/jenkins/agent/workspace/release_tidb_4.0/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:740\ngithub.com/pingcap/tidb-lightning/lightning/restore.(*TableRestore).restoreTable\n\t/home/jenkins/agent/workspace/release_tidb_4.0/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:714\ngithub.com/pingcap/tidb-lightning/lightning/restore.(*RestoreController).restoreTables.func1\n\t/home/jenkins/agent/workspace/release_tidb_4.0/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:574\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1337"]

Describe the feature you'd like:

My workaround is to start lightning with a sleep call in between launching components:

#!/bin/sh
mysql -e 'drop database if exists ontime'

killall -9 tikv-importer
killall -9 tidb-lightning

BASE=/mnt/evo860

cd $BASE
rm -rf $BASE/tmp && mkdir -p $BASE/tmp
./bin/tikv-importer --import-dir $BASE/tmp --log-level info -A 0.0.0.0:8287 &
sleep 5  
./bin/tidb-lightning -d $BASE/data-sets/ontime-data -importer localhost:8287

tidb-server and tikv-server do not require this sleep call. They can be started slightly before a pd-server, and will retry their connectivity. This is useful for distributed systems since it is hard to guarantee exact order of startup.

Describe alternatives you've considered:

N/A

Teachability, Documentation, Adoption, Optimization:

Same behavior as tidb-server and tikv-server.

wwar avatar Jan 25 '20 21:01 wwar