tiflow icon indicating copy to clipboard operation
tiflow copied to clipboard

The `ticdc_owner_status` is not removed when removing a failed changefeed

Open kennytm opened this issue 11 months ago • 7 comments

What did you do?

# upstream: 127.0.0.1:4000, downstream: 127.0.0.1:4001

cdc cli changefeed create --sink-uri 'mysql://[email protected]:4001/' -c testcdc3
mysql -u root -h 127.0.0.1 -P 4000 test -e 'create table a(id bigint primary key, val varchar(200)); insert into a values (1, "one"), (2, "two"), (3, "three");'

# force an error
mysql -u root -h 127.0.0.1 -P 4001 test -e 'drop table a;';
mysql -u root -h 127.0.0.1 -P 4000 test -e 'insert into a values (4, "four");'

# speed up "warning" → "failed", optional.
cdc cli unsafe delete-service-gc-safepoint

# wait until ticdc_owner_status changed from 6 to 2
curl -s http://127.0.0.1:8300/metrics | grep ticdc_owner_status

# remove the changefeed after it has failed
cdc cli changefeed remove -c testcdc3

# check the metrics again.
curl -s http://127.0.0.1:8300/metrics | grep ticdc_owner_status

What did you expect to see?

The series ticdc_owner_status{changefeed="testcdc3",namespace="default"} no longer exists

What did you see instead?

It remains at value 2 (failed)

This means a Prometheus Alert will keep firing for a changefeed that is already gone.

Versions of the cluster

v6.5.5

kennytm avatar Mar 12 '24 09:03 kennytm

~~While #10513 has not been merged to release-6.5 I don't think that PR has any effect on this issue. I haven't tested release-7.5 though.~~

EDIT: Not reproducible on v7.5.1.

kennytm avatar Mar 12 '24 09:03 kennytm

/severity moderate

fubinzh avatar Mar 13 '24 09:03 fubinzh

if there is a way to immediately fail a changefeed we could quickly check if the recently unstuck #10513 has fixed this :wink:

kennytm avatar Mar 13 '24 10:03 kennytm

if there is a way to immediately fail a changefeed we could quickly check if the recently unstuck #10513 has fixed this 😉

Yes, there is a way to do it.

  1. Set cdc server config gc-ttl to 1.
  2. Start cdc server with this config.
  3. Create a changefeed and pause it, wait about 30 minutes to make sure GC is advanced in upstream. (Because the default gc-life-time and gc-interval is 10 minutes in uptream TiDB).

The changefeed should already failed.

asddongmen avatar Mar 13 '24 12:03 asddongmen

  1. Create a changefeed and pause it, wait about 30 minutes to make sure GC is advanced in upstream.

i mean this is the step i'd like to skip :sweat_smile:

kennytm avatar Mar 13 '24 21:03 kennytm

  1. Create a changefeed and pause it, wait about 30 minutes to make sure GC is advanced in upstream.

i mean this is the step i'd like to skip 😅

Maybe we can add an error injecting API to do it?

asddongmen avatar Mar 14 '24 02:03 asddongmen

  1. Create a changefeed and pause it, wait about 30 minutes to make sure GC is advanced in upstream.

i mean this is the step i'd like to skip 😅

Maybe we can add an error injecting API to do it?

Yeah. But not really high priority if you need to introduce another PR to get this.

kennytm avatar Mar 14 '24 15:03 kennytm

Duplicate with #10449, fixed by https://github.com/pingcap/tiflow/pull/10513

asddongmen avatar May 28 '24 07:05 asddongmen