tikv icon indicating copy to clipboard operation
tikv copied to clipboard

Massive amounts of `failed to update max timestamp for region` logs

Open youjiali1995 opened this issue 3 years ago • 1 comments

Bug Report

What version of TiKV are you using?

v6.1.1

What operating system and CPU are you using?

doesn't matter

Steps to reproduce

Kill PD leader. There is no backoff for tso request. If it takes much time to update PD client, there will be massive amounts of failed to update max timestamp for region logs.

[2022/09/15 12:31:29.984 +00:00] [WARN] [pd.rs:868] ["failed to update max timestamp for region [1:17:0]: Other(\"[components/pd_client/src/tso.rs:94]: TimestampRequest channel is closed\")"]
... 910k line
[2022/09/15 12:31:36.301 +00:00] [WARN] [pd.rs:868] ["failed to update max timestamp for region [1:17:0]: Other(\"[components/pd_client/src/tso.rs:94]: TimestampRequest channel is closed\")"]
[2022/09/15 12:31:36.301 +00:00] [ERROR] [client.rs:629] ["failed to send heartbeat"] [err_code=KV:PD:gRPC] [err="Grpc(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: \"Connection timed out\", details: [] })))"]
[2022/09/15 12:31:36.301 +00:00] [INFO] [util.rs:231] ["buckets sender and receiver are stale, refreshing ..."]
[2022/09/15 12:31:36.301 +00:00] [INFO] [util.rs:258] ["update pd client"] [via=] [leader=http://node-1.node-peer.jepsen-tps-1261653-1-351:2379] [prev_via=] [prev_leader=http://node-1.node-peer.jepsen-tps-1261653-1-351:2379]
[2022/09/15 12:31:36.301 +00:00] [INFO] [util.rs:385] ["trying to update PD client done"] [spend=6.009792062s]

youjiali1995 avatar Sep 16 '22 07:09 youjiali1995

A backoff may increase the recovery time. I think we can add a function to the pd client that will check its liveness state. like:

while txn_ext.max_ts_sync_status.load(Ordering::SeqCst) == initial_status {
    match pd_client.get_tso().await {
        Ok(ts) => { ...; break; }
        Err(e) => {
            warn!(
                "failed to update max timestamp for region {}: {:?}",
                region_id, e
            );
            pd_client.wait_until_connected().await;
        }
    }
}

sticnarf avatar Sep 16 '22 07:09 sticnarf

Or how about making it a metrics to check for max timestamp gap, so that logs don't need to print here.

BusyJay avatar Nov 22 '22 19:11 BusyJay