tiflow
tiflow copied to clipboard
The `ticdc_owner_status` is not removed when removing a failed changefeed
What did you do?
# upstream: 127.0.0.1:4000, downstream: 127.0.0.1:4001
cdc cli changefeed create --sink-uri 'mysql://[email protected]:4001/' -c testcdc3
mysql -u root -h 127.0.0.1 -P 4000 test -e 'create table a(id bigint primary key, val varchar(200)); insert into a values (1, "one"), (2, "two"), (3, "three");'
# force an error
mysql -u root -h 127.0.0.1 -P 4001 test -e 'drop table a;';
mysql -u root -h 127.0.0.1 -P 4000 test -e 'insert into a values (4, "four");'
# speed up "warning" → "failed", optional.
cdc cli unsafe delete-service-gc-safepoint
# wait until ticdc_owner_status changed from 6 to 2
curl -s http://127.0.0.1:8300/metrics | grep ticdc_owner_status
# remove the changefeed after it has failed
cdc cli changefeed remove -c testcdc3
# check the metrics again.
curl -s http://127.0.0.1:8300/metrics | grep ticdc_owner_status
What did you expect to see?
The series ticdc_owner_status{changefeed="testcdc3",namespace="default"}
no longer exists
What did you see instead?
It remains at value 2 (failed)
This means a Prometheus Alert will keep firing for a changefeed that is already gone.
Versions of the cluster
v6.5.5
~~While #10513 has not been merged to release-6.5 I don't think that PR has any effect on this issue. I haven't tested release-7.5 though.~~
EDIT: Not reproducible on v7.5.1.
/severity moderate
if there is a way to immediately fail a changefeed we could quickly check if the recently unstuck #10513 has fixed this :wink:
if there is a way to immediately fail a changefeed we could quickly check if the recently unstuck #10513 has fixed this 😉
Yes, there is a way to do it.
- Set cdc server config
gc-ttl
to1
. - Start cdc server with this config.
- Create a changefeed and pause it, wait about 30 minutes to make sure GC is advanced in upstream. (Because the default
gc-life-time
andgc-interval
is 10 minutes in uptream TiDB).
The changefeed should already failed.
- Create a changefeed and pause it, wait about 30 minutes to make sure GC is advanced in upstream.
i mean this is the step i'd like to skip :sweat_smile:
- Create a changefeed and pause it, wait about 30 minutes to make sure GC is advanced in upstream.
i mean this is the step i'd like to skip 😅
Maybe we can add an error injecting API to do it?
- Create a changefeed and pause it, wait about 30 minutes to make sure GC is advanced in upstream.
i mean this is the step i'd like to skip 😅
Maybe we can add an error injecting API to do it?
Yeah. But not really high priority if you need to introduce another PR to get this.
Duplicate with #10449, fixed by https://github.com/pingcap/tiflow/pull/10513