tidb icon indicating copy to clipboard operation
tidb copied to clipboard

tidb lightning: daily run find 'Lightning is stuck" for hours

Open shaoxiqian opened this issue 1 year ago • 2 comments

[git-hash=b9a31b231a7d9a64da81cb071b3db26fcb55cc38]

toml


        [lightning]
        level = "info"
        check-requirements = false
        status-addr = ':8289'
        file = "/tmp/tidb-lightning_1719543717.log"

        [tikv-importer]
        backend = "local"
        incremental-import = false
        sorted-kv-dir = "/tiup/sorted-kv-dir"

        [tidb]
        # Information of the target cluster
        port = 4000
        user = "root"
        password = ""
        host = "tidb-1-peer"
        status-port = 10080
        pd-addr = "pd-peer:2379"

        [mydumper]
        no-schema = true
        data-source-dir = 's3://tmp/test?access-key=minioadmin&secret-access-key=minioadmin&endpoint=http://minio-peer:9000'
        [mydumper.csv]
        header = false

        [checkpoint]
        # Whether to enable checkpoints.
        enable = true
        driver = "file"

        [post-restore]
        checksum = true

image

shaoxiqian avatar Jun 28 '24 08:06 shaoxiqian

According to the goroutine we are stuck at

https://github.com/pingcap/tidb/blob/b9a31b231a7d9a64da81cb071b3db26fcb55cc38/br/pkg/restore/split/split.go#L96-L109

which is busy waiting on

https://github.com/pingcap/tidb/blob/b9a31b231a7d9a64da81cb071b3db26fcb55cc38/br/pkg/utils/retry.go#L239-L244

there are no error logs, and the select does not seem to be waiting for more than 1 minute.

I suspect this condition is erroneously triggered

https://github.com/pingcap/tidb/blob/b9a31b231a7d9a64da81cb071b3db26fcb55cc38/br/pkg/restore/split/split.go#L132-L136

which put the retry loop into an infinite loop

kennytm avatar Jun 28 '24 09:06 kennytm

caused by PD updates the API in https://github.com/pingcap/tidb/pull/54153 and test cluster is using old version PD. Lightning will not print the error details and this needs to be improved.

Still in discuss about what's the expected fix

lance6716 avatar Jul 01 '24 07:07 lance6716