tikv
tikv copied to clipboard
Massive amounts of `failed to update max timestamp for region` logs
Bug Report
What version of TiKV are you using?
v6.1.1
What operating system and CPU are you using?
doesn't matter
Steps to reproduce
Kill PD leader. There is no backoff for tso request. If it takes much time to update PD client, there will be massive amounts of failed to update max timestamp for region logs.
[2022/09/15 12:31:29.984 +00:00] [WARN] [pd.rs:868] ["failed to update max timestamp for region [1:17:0]: Other(\"[components/pd_client/src/tso.rs:94]: TimestampRequest channel is closed\")"]
... 910k line
[2022/09/15 12:31:36.301 +00:00] [WARN] [pd.rs:868] ["failed to update max timestamp for region [1:17:0]: Other(\"[components/pd_client/src/tso.rs:94]: TimestampRequest channel is closed\")"]
[2022/09/15 12:31:36.301 +00:00] [ERROR] [client.rs:629] ["failed to send heartbeat"] [err_code=KV:PD:gRPC] [err="Grpc(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: \"Connection timed out\", details: [] })))"]
[2022/09/15 12:31:36.301 +00:00] [INFO] [util.rs:231] ["buckets sender and receiver are stale, refreshing ..."]
[2022/09/15 12:31:36.301 +00:00] [INFO] [util.rs:258] ["update pd client"] [via=] [leader=http://node-1.node-peer.jepsen-tps-1261653-1-351:2379] [prev_via=] [prev_leader=http://node-1.node-peer.jepsen-tps-1261653-1-351:2379]
[2022/09/15 12:31:36.301 +00:00] [INFO] [util.rs:385] ["trying to update PD client done"] [spend=6.009792062s]
A backoff may increase the recovery time. I think we can add a function to the pd client that will check its liveness state. like:
while txn_ext.max_ts_sync_status.load(Ordering::SeqCst) == initial_status {
match pd_client.get_tso().await {
Ok(ts) => { ...; break; }
Err(e) => {
warn!(
"failed to update max timestamp for region {}: {:?}",
region_id, e
);
pd_client.wait_until_connected().await;
}
}
}
Or how about making it a metrics to check for max timestamp gap, so that logs don't need to print here.