tiflow The `ticdc_owner_status` is not removed when removing a failed changefeed

What did you do?

# upstream: 127.0.0.1:4000, downstream: 127.0.0.1:4001

cdc cli changefeed create --sink-uri 'mysql://[email protected]:4001/' -c testcdc3
mysql -u root -h 127.0.0.1 -P 4000 test -e 'create table a(id bigint primary key, val varchar(200)); insert into a values (1, "one"), (2, "two"), (3, "three");'

# force an error
mysql -u root -h 127.0.0.1 -P 4001 test -e 'drop table a;';
mysql -u root -h 127.0.0.1 -P 4000 test -e 'insert into a values (4, "four");'

# speed up "warning" → "failed", optional.
cdc cli unsafe delete-service-gc-safepoint

# wait until ticdc_owner_status changed from 6 to 2
curl -s http://127.0.0.1:8300/metrics | grep ticdc_owner_status

# remove the changefeed after it has failed
cdc cli changefeed remove -c testcdc3

# check the metrics again.
curl -s http://127.0.0.1:8300/metrics | grep ticdc_owner_status

What did you expect to see?

The series ticdc_owner_status{changefeed="testcdc3",namespace="default"} no longer exists

What did you see instead?

It remains at value 2 (failed)

This means a Prometheus Alert will keep firing for a changefeed that is already gone.

Versions of the cluster

v6.5.5

Mar 12 '24 09:03 kennytm

~~While #10513 has not been merged to release-6.5 I don't think that PR has any effect on this issue. I haven't tested release-7.5 though.~~

EDIT: Not reproducible on v7.5.1.

Mar 12 '24 09:03 kennytm

/severity moderate

Mar 13 '24 09:03 fubinzh

if there is a way to immediately fail a changefeed we could quickly check if the recently unstuck #10513 has fixed this :wink:

Mar 13 '24 10:03 kennytm

if there is a way to immediately fail a changefeed we could quickly check if the recently unstuck #10513 has fixed this 😉

Yes, there is a way to do it.

Set cdc server config gc-ttl to 1.
Start cdc server with this config.
Create a changefeed and pause it, wait about 30 minutes to make sure GC is advanced in upstream. (Because the default gc-life-time and gc-interval is 10 minutes in uptream TiDB).

The changefeed should already failed.

Mar 13 '24 12:03 asddongmen

Create a changefeed and pause it, wait about 30 minutes to make sure GC is advanced in upstream.

i mean this is the step i'd like to skip :sweat_smile:

Mar 13 '24 21:03 kennytm

Create a changefeed and pause it, wait about 30 minutes to make sure GC is advanced in upstream.

i mean this is the step i'd like to skip 😅

Maybe we can add an error injecting API to do it?

Mar 14 '24 02:03 asddongmen

Create a changefeed and pause it, wait about 30 minutes to make sure GC is advanced in upstream.

i mean this is the step i'd like to skip 😅

Maybe we can add an error injecting API to do it?

Yeah. But not really high priority if you need to introduce another PR to get this.

Mar 14 '24 15:03 kennytm

Duplicate with #10449, fixed by https://github.com/pingcap/tiflow/pull/10513

May 28 '24 07:05 asddongmen

tiflow tiflow copied to clipboard

The `ticdc_owner_status` is not removed when removing a failed changefeed

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

tiflow
tiflow copied to clipboard