tiflow icon indicating copy to clipboard operation
tiflow copied to clipboard

kvclient(ticdc): fix the resolved ts lag increase since the store id incorrect then cause the store version check failed (#12172)

Open ti-chi-bot opened this issue 6 months ago • 3 comments

This is an automated cherry-pick of #12172

What problem does this PR solve?

Issue Number: close #12162

What is changed and how it works?

KV client before establish connection to the TiKV, check the version compatiblity by query the store information from the PD, but use the store id obtained from the RegionCache. The RegionCache may return a staled store id, then cause the version check failed.

In this PR, use the pd's GetAllStores method to get all TiKV stores information, no need to specify the store id, so the problem can be fixed.

Check List

Tests

  • Manual test (add detailed scripts or steps below)
VERSION=v8.5.1

# 1. Start a playground
tiup playground $VERSION --db 1 --kv 4 --pd 1 --tiflash 0
# 2. Start CDC server (run it independent from the playground so we can monitor when it crashes)
tiup cdc:$VERSION server

# 3. Prepare some data
tiup bench tpcc prepare --warehouses 4 -T 8
# make sure every table is eligible
mysql -u root -h 127.0.0.1 -P 4000 -e 'alter table test.history add primary key (h_c_id, h_c_d_id, h_c_w_id);'

# 4. Scatter all regions
# (or just transfer-leader one certain region to 127.0.0.1:20160)
for region_id in $(tiup ctl:$VERSION pd region --jq '.regions[].id'); do tiup ctl:$VERSION pd operator add scatter-region $region_id; done

# 5. Start a changefeed
tiup cdc:$VERSION cli changefeed create -c test --sink-uri 'blackhole://'

# 6. Scale-in 127.0.0.1:20160 (*must* use :20160 in order for step 8 to reuse this port)
tiup playground scale-in --pid $(pgrep -f '127.0.0.1:20160')

# 7. wait until tikv is removed
tiup playground display

# 8. Scale-out a tikv
tiup playground scale-out --kv 1

# 9. Scatter again
# (or just transfer-leader one certain region to 127.0.0.1:20160)
for region_id in $(tiup ctl:$VERSION pd region --jq '.regions[].id'); do tiup ctl:$VERSION pd operator add scatter-region $region_id; done

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

fix the resolved ts lag caused by use staled store id after scale-in and scale-out tikv instances on the same IP address.

ti-chi-bot avatar Jun 18 '25 07:06 ti-chi-bot

@3AceShowHand This PR has conflicts, I have hold it. Please resolve them or ask others to resolve them, then comment /unhold to remove the hold label.

ti-chi-bot avatar Jun 18 '25 07:06 ti-chi-bot

/gemini review

3AceShowHand avatar Jul 02 '25 11:07 3AceShowHand

/retest

3AceShowHand avatar Jul 03 '25 03:07 3AceShowHand

/retest

3AceShowHand avatar Jul 03 '25 09:07 3AceShowHand

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 3AceShowHand

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot[bot] avatar Jul 03 '25 09:07 ti-chi-bot[bot]

/unhold

3AceShowHand avatar Jul 10 '25 09:07 3AceShowHand

/retest

3AceShowHand avatar Jul 11 '25 02:07 3AceShowHand

/retest

3AceShowHand avatar Jul 11 '25 05:07 3AceShowHand

/retest

3AceShowHand avatar Jul 11 '25 06:07 3AceShowHand

/retest

JQWong7 avatar Jul 11 '25 09:07 JQWong7