tikv icon indicating copy to clipboard operation
tikv copied to clipboard

unsafe recovery: some tables are always unavailable when alter tiflash replica after unsafe recovery finished with tiflash

Open Lily2025 opened this issue 7 months ago • 4 comments

Bug Report

What version of TiKV are you using?

TiKV Release Version: 8.5.2 Edition: Community Git Commit Hash: https://github.com/tikv/tikv/commit/a150e4569fda1c64763fda297f4e09775759de4a Git Commit Branch: HEAD UTC Build Time: 2025-04-24 12:10:31 Rust Version: rustc 1.77.0-nightly (89e2160c4 2023-12-27) Enable Features: memory-engine pprof-fp jemalloc mem-profiling portable sse test-engine-kv-rocksdb test-engine-raft-raft-engine trace-async-tasks openssl-vendored Profile: dist_release

What operating system and CPU are you using?

8c/16g

Steps to reproduce

1、dr auto sync deploy with tiflash 2 replicas 2、run tpcc 3、unsafe remove-failed-stores (include one tiflash) succeed "region 52973 demotes peers { id:52974 store_id:1 }, { id:52976 store_id:9 }, { id:52979 store_id:2 }, { id:52980 store_id:502 role:Learner }", "region 52927 demotes peers { id:52928 store_id:1 }, { id:52930 store_id:9 }, { id:52933 store_id:2 }", "region 52982 demotes peers { id:52983 store_id:1 }, { id:52985 store_id:9 }, { id:52988 store_id:2 }", "region 52989 demotes peers { id:52990 store_id:1 }, { id:52992 store_id:9 }, { id:52995 store_id:2 }", "region 52996 demotes peers { id:52997 store_id:1 }, { id:52999 store_id:9 }, { id:53002 store_id:2 }", "region 53003 demotes peers { id:53004 store_id:1 }, { id:53006 store_id:9 }, { id:53009 store_id:2 }", "region 53061 demotes peers { id:53062 store_id:1 }, { id:53064 store_id:9 }, { id:53067 store_id:2 }", "region 53038 demotes peers { id:53039 store_id:1 }, { id:53041 store_id:9 }, { id:53044 store_id:2 }", "region 53068 demotes peers { id:53069 store_id:1 }, { id:53071 store_id:9 }, { id:53074 store_id:2 }", "region 53010 demotes peers { id:53011 store_id:1 }, { id:53013 store_id:9 }, { id:53016 store_id:2 }", "region 53017 demotes peers { id:53018 store_id:1 }, { id:53020 store_id:9 }, { id:53023 store_id:2 }", "region 53024 demotes peers { id:53025 store_id:1 }, { id:53027 store_id:9 }, { id:53030 store_id:2 }", "region 53031 demotes peers { id:53032 store_id:1 }, { id:53034 store_id:9 }, { id:53037 store_id:2 }", "region 16 demotes peers { id:17 store_id:1 }, { id:28 store_id:9 }, { id:37 store_id:2 }", "region 18 demotes peers { id:19 store_id:1 }, { id:27 store_id:2 }, { id:499 store_id:9 }", "region 3 demotes peers { id:29 store_id:9 }, { id:39 store_id:2 }, { id:472 store_id:1 }" ], "store 8": [ "region 567 demotes peers { id:568 store_id:1 }, { id:570 store_id:9 }, { id:573 store_id:2 }", "region 2100 demotes peers { id:2101 store_id:1 }, { id:2103 store_id:9 }, { id:2106 store_id:2 }", "region 53045 demotes peers { id:53046 store_id:1 }, { id:53048 store_id:9 }, { id:53051 store_id:2 }", "region 83 demotes peers { id:84 store_id:1 }, { id:86 store_id:9 }, { id:468 store_id:2 }" ] } }, { "info": "Unsafe recovery Finished", "time": "2025-05-06 12:45:01.366", "details": [ "affected table ids: 157, 158, 129, 156, 120, 127, 168, 174, 281474976710648, 124, 155, 176, 178, 172, 126, 125, 152, 170, 122, 123, 180, 121, 28, 153, 154, 26, 108, 128, 182", "no newly created empty regions" ] } ]

What did you expect?

some tables are always unavailable after alter tiflash replica

What did happened?

some tables are always unavailable after alter tiflash replica

Image

Lily2025 avatar May 15 '25 09:05 Lily2025

/assign glorv

Lily2025 avatar May 15 '25 09:05 Lily2025

/assign v01dstar

Lily2025 avatar May 15 '25 09:05 Lily2025

/severity major

Lily2025 avatar May 15 '25 09:05 Lily2025

NOTE: the issue is not about add tiflash replica again after unsafe recovery but unsafe recovery itself.

During unsafe recovery, pd will directly tombstone tiflash replicas whose applied index are higher than the max raft index of all alive voters. But it does not remove this peer in the raft state. So after unsafe recovery, the raft group still contains these learner peers while the raft state in tiflash is "tombstone". So these peer can't replica raft logs anymore as they are tombstoned. Unsafe recovery relies on PD's "remove down peer" mechanism to remove this kind of peers and add new replicas on demand. But in case that there is only 1 active tiflash instance and the peer on it is tombstoned, PD can't find a new store to add a new learner so the tiflash replica can't be recovered forever. /cc @Connor1996 @v01dstar

glorv avatar May 15 '25 09:05 glorv