tikv
tikv copied to clipboard
Slow message handling: `SignificantMsg(RaftLogGcFlushed)` takes 412ms
Development Task
[2025/06/11 02:37:11.909 +00:00] [INFO] [raft.rs:2660] ["switched to configuration"] [config="Configuration { voters: Configuration { incoming: Configuration { voters: {533554, 533553, 1242350} }, outgoing: Configuration { voters: {533552, 533553, 533554} } }, learners: {}, learners_next: {533552}, auto_leave: false }"] [raft_id=533552] [region_id=533551] [threa
d_id=105]
[2025/06/11 02:37:11.909 +00:00] [INFO] [raft.rs:2660] ["switched to configuration"] [config="Configuration { voters: Configuration { incoming: Configuration { voters: {533554, 533553, 1242350} }, outgoing: Configuration { voters: {} } }, learners: {533552}, learners_next: {}, auto_leave: false }"] [raft_id=533552] [region_id=533551] [thread_id=105]
[2025/06/11 02:37:11.909 +00:00] [INFO] [raft.rs:2660] ["switched to configuration"] [config="Configuration { voters: Configuration { incoming: Configuration { voters: {533554, 533553, 1242350} }, outgoing: Configuration { voters: {} } }, learners: {}, learners_next: {}, auto_leave: false }"] [raft_id=533552] [region_id=533551] [thread_id=105]
[2025/06/11 02:37:11.909 +00:00] [INFO] [peer.rs:3935] ["delays destroy"] [reason=UnFlushLogGc] [merged_by_target=false] [peer_id=533552] [region_id=533551] [thread_id=105]
[2025/06/11 02:37:12.209 +00:00] [INFO] [peer.rs:3990] ["starts destroy"] [is_latest_initialized=true] [is_peer_initialized=true] [merged_by_target=false] [peer_id=533552] [region_id=533551] [thread_id=104]
[2025/06/11 02:37:12.209 +00:00] [INFO] [peer.rs:1328] ["begin to destroy"] [peer_id=533552] [region_id=533551] [thread_id=104]
[2025/06/11 02:37:12.209 +00:00] [INFO] [pd.rs:1811] ["remove peer statistic record in pd"] [region_id=533551] [thread_id=34]
[2025/06/11 02:37:12.209 +00:00] [INFO] [peer_storage.rs:1052] ["finish clear peer meta"] [takes=9.82µs] [raft_key=1] [apply_key=1] [meta_key=1] [region_id=533551] [thread_id=104]
[2025/06/11 02:37:12.621 +00:00] [INFO] [peer.rs:1432] ["peer destroy itself"] [keep_data=false] [clean=true] [takes=411.969606ms] [peer_id=533552] [region_id=533551] [thread_id=104]
[2025/06/11 02:37:12.621 +00:00] [INFO] [router.rs:283] ["shutdown mailbox"] [region_id=533551] [thread_id=104]
[2025/06/11 02:37:12.621 +00:00] [INFO] [region.rs:650] ["register deleting data in range"] [end_key=7A7480000000000000FF825F728000000003FF5E2CFD0000000000FA] [start_key=7A7480000000000000FF825F728000000003FF5E0B8D0000000000FA] [region_id=533551] [thread_id=84]
[2025/06/11 02:37:12.621 +00:00] [WARN] [peer.rs:750] ["dbg slow peer message handling"] [elapsed=412.024847ms] [msg=SignificantMsg(RaftLogGcFlushed)] [peer_id=533552] [region_id=533551] [thread_id=104]
The log is taken from a TiKV instance that runs 238k peers and manages 231k sst files.
Cluster Details
The TiKV is built from https://github.com/overvenus/tikv/tree/log-slow-message
TiKV configuration:
[coprocessor]
region-split-size = "2MiB"
[log]
[log.file]
filename = "/var/lib/tikv/log/tikv.log"
max-backups = 100
max-size = 300
[rocksdb]
max-manifest-file-size = "256MiB"
[rocksdb.defaultcf]
compaction-guard-max-output-file-size = "2MiB"
compaction-guard-min-output-file-size = "1MiB"
target-file-size-base = "1MiB"
[rocksdb.writecf]
compaction-guard-max-output-file-size = "2MiB"
compaction-guard-min-output-file-size = "1MiB"
target-file-size-base = "1MiB"
[server]
snap-min-ingest-size = "2KiB"
[storage]
reserve-space = "0MB"
Test scripts:
# Prepare
sysbench --db-driver=mysql --mysql-host=tc-tidb-peer --mysql-port=4000 --mysql-user=root --mysql-db=sbtest --tables=35 --table-size=40351404 --time=3600 --report-interval=3 --threads=20 --auto_inc=true --create_secondary=off --non_index_updates=10 --mysql-ignore-errors=1105 oltp_update_non_index prepare
# Run
sysbench --db-driver=mysql --mysql-host=tc-tidb-peer --mysql-port=4000 --mysql-user=root --mysql-db=sbtest --tables=35 --table-size=40351404 --time=3600 --report-interval=3 --threads=200 --auto_inc=true --create_secondary=off --non_index_updates=10 --mysql-ignore-errors=1105 oltp_update_non_index run
Metrics indicate peer destruction delays in v7.5.6, which will slow down the raftstore.propose processing. This issue has been mitigated by the fix in https://github.com/tikv/tikv/pull/18040.
Further improvements could be achieved by moving destroy_peer processing out of raftstore's critical path.