tidb lightning: daily run find 'Lightning is stuck" for hours
[git-hash=b9a31b231a7d9a64da81cb071b3db26fcb55cc38]
toml
[lightning]
level = "info"
check-requirements = false
status-addr = ':8289'
file = "/tmp/tidb-lightning_1719543717.log"
[tikv-importer]
backend = "local"
incremental-import = false
sorted-kv-dir = "/tiup/sorted-kv-dir"
[tidb]
# Information of the target cluster
port = 4000
user = "root"
password = ""
host = "tidb-1-peer"
status-port = 10080
pd-addr = "pd-peer:2379"
[mydumper]
no-schema = true
data-source-dir = 's3://tmp/test?access-key=minioadmin&secret-access-key=minioadmin&endpoint=http://minio-peer:9000'
[mydumper.csv]
header = false
[checkpoint]
# Whether to enable checkpoints.
enable = true
driver = "file"
[post-restore]
checksum = true
According to the goroutine we are stuck at
https://github.com/pingcap/tidb/blob/b9a31b231a7d9a64da81cb071b3db26fcb55cc38/br/pkg/restore/split/split.go#L96-L109
which is busy waiting on
https://github.com/pingcap/tidb/blob/b9a31b231a7d9a64da81cb071b3db26fcb55cc38/br/pkg/utils/retry.go#L239-L244
there are no error logs, and the select does not seem to be waiting for more than 1 minute.
I suspect this condition is erroneously triggered
https://github.com/pingcap/tidb/blob/b9a31b231a7d9a64da81cb071b3db26fcb55cc38/br/pkg/restore/split/split.go#L132-L136
which put the retry loop into an infinite loop
caused by PD updates the API in https://github.com/pingcap/tidb/pull/54153 and test cluster is using old version PD. Lightning will not print the error details and this needs to be improved.
Still in discuss about what's the expected fix