Outlier point in tikv resolved ts lag metrics
What did you do?
Run ticdc replication, with a large number of regions.
What did you expect to see?
No response
What did you see instead?

- The replication works normally, without any large lag

- But the resolved ts lag duration percentile in TiKV metrics has outlier point, with
2.3 hourlag in resovled ts
I checked the CDC_RESOLVED_TS_GAP_HISTOGRAM metric in TiKV, it is exponential_buckets(0.001, 2.0, 24)
and 0.001 * 2^(24-1) = 8388.608s = 2.33h, suspect a zero resolved_ts is counted when calculating resolved ts gap
Versions of the cluster
Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):
TiCDC version (execute cdc version):
in v4.1.14 and v5.2.1 both
Maybe we can try quantile_over_time?
See https://grafana.com/blog/2020/10/20/quick-tip-how-prometheus-can-make-visualizing-noisy-data-easier/
Reproduced this issue in v5.3.0
Test case is network loss in all tikv from test-infra.
/unassign @overvenus @zhaoxinyu /assign @sdojjy
Close it since it's stale.