tiflow icon indicating copy to clipboard operation
tiflow copied to clipboard

Outlier point in tikv resolved ts lag metrics

Open amyangfei opened this issue 4 years ago • 4 comments

What did you do?

Run ticdc replication, with a large number of regions.

What did you expect to see?

No response

What did you see instead?

screenshot-20211013-170229

  • The replication works normally, without any large lag

screenshot-20211013-170202

  • But the resolved ts lag duration percentile in TiKV metrics has outlier point, with 2.3 hour lag in resovled ts

I checked the CDC_RESOLVED_TS_GAP_HISTOGRAM metric in TiKV, it is exponential_buckets(0.001, 2.0, 24)

and 0.001 * 2^(24-1) = 8388.608s = 2.33h, suspect a zero resolved_ts is counted when calculating resolved ts gap

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

TiCDC version (execute cdc version):

in v4.1.14 and v5.2.1 both

amyangfei avatar Oct 13 '21 09:10 amyangfei

Maybe we can try quantile_over_time?

See https://grafana.com/blog/2020/10/20/quick-tip-how-prometheus-can-make-visualizing-noisy-data-easier/

overvenus avatar Nov 09 '21 08:11 overvenus

Reproduced this issue in v5.3.0

Tammyxia avatar Nov 11 '21 09:11 Tammyxia

GGOy6YYmtU Test case is network loss in all tikv from test-infra.

Tammyxia avatar Nov 11 '21 10:11 Tammyxia

/unassign @overvenus @zhaoxinyu /assign @sdojjy

nongfushanquan avatar Sep 22 '22 09:09 nongfushanquan

Close it since it's stale.

asddongmen avatar May 21 '24 02:05 asddongmen