tiup cluster prune (tombstone TiFlash)ERROR
Bug Report
Please answer these questions before submitting your issue. Thanks!
- What did you do?
Topo of tidb-m cluster:
[tidb@container ~]$ tiup cluster display tidb-m
tiup is checking updates for component cluster ...
Starting component
cluster: /home/tidb/.tiup/components/cluster/v1.10.3/tiup-cluster display tidb-m Cluster type: tidb Cluster name: tidb-m Cluster version: v6.2.0 Deploy user: tidb SSH type: builtin Dashboard URL: http://172.16.0.62:2379/dashboard Grafana URL: http://172.16.0.150:3000 ID Role Host Ports OS/Arch Status Data Dir Deploy Dir
172.16.0.150:9093 alertmanager 172.16.0.150 9093/9094 linux/x86_64 Up /home/tidb/data/alertmanager-9093 /home/tidb/deploy/alertmanager-9093 172.16.0.150:3000 grafana 172.16.0.150 3000 linux/x86_64 Up - /home/tidb/deploy/grafana-3000 172.16.0.61:2379 pd 172.16.0.61 2379/2380 linux/x86_64 Up /home/tidb/data/pd-2379 /home/tidb/deploy/pd-2379 172.16.0.62:2379 pd 172.16.0.62 2379/2380 linux/x86_64 Up|UI /home/tidb/data/pd-2379 /home/tidb/deploy/pd-2379 172.16.0.63:2379 pd 172.16.0.63 2379/2380 linux/x86_64 Up|L /home/tidb/data/pd-2379 /home/tidb/deploy/pd-2379 172.16.0.150:9090 prometheus 172.16.0.150 9090/12020 linux/x86_64 Up /data/prometheus-9090 /home/tidb/deploy/prometheus-9090 172.16.0.150:4000 tidb 172.16.0.150 4000/10080 linux/x86_64 Up - /home/tidb/deploy/tidb-4000 172.16.0.150:4001 tidb 172.16.0.150 4001/10081 linux/x86_64 Up - /home/tidb/deploy/tidb-4001 172.16.0.64:9000 tiflash 172.16.0.64 9000/8123/3930/20170/20292/8234 linux/x86_64 Up /home/tidb/data/tiflash-9000 /home/tidb/deploy/tiflash-9000 172.16.0.65:9000 tiflash 172.16.0.65 9000/8123/3930/20170/20292/8234 linux/x86_64 Up /home/tidb/data/tiflash-9000 /home/tidb/deploy/tiflash-9000 172.16.0.66:9000 tiflash 172.16.0.66 9000/8123/3930/20170/20292/8234 linux/x86_64 Up /home/tidb/data/tiflash-9000 /home/tidb/deploy/tiflash-9000 172.16.0.71:20160 tikv 172.16.0.71 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160 172.16.0.72:20160 tikv 172.16.0.72 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160 172.16.0.73:20160 tikv 172.16.0.73 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160 172.16.0.74:20160 tikv 172.16.0.74 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160 172.16.0.75:20160 tikv 172.16.0.75 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160 172.16.0.76:20160 tikv 172.16.0.76 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160 Total nodes: 17
than scale-in 1 of tiflash node.
[tidb@container ~]$ tiup cluster scale-in tidb-m --node 172.16.0.66:9000
tiup is checking updates for component cluster ...
Starting component cluster: /home/tidb/.tiup/components/cluster/v1.10.3/tiup-cluster scale-in tidb-m --node 172.16.0.66:9000
This operation will delete the 172.16.0.66:9000 nodes in tidb-m and all their data.
Do you want to continue? [y/N]:(default=N) y
The component [tiflash] will become tombstone, maybe exists in several minutes or hours, after that you can use the prune command to clean it
Do you want to continue? [y/N]:(default=N) y
Scale-in nodes...
- [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-m/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-m/ssh/id_rsa.pub
- [Parallel] - UserSSH: user=tidb, host=172.16.0.72
- [Parallel] - UserSSH: user=tidb, host=172.16.0.62
- [Parallel] - UserSSH: user=tidb, host=172.16.0.73
- [Parallel] - UserSSH: user=tidb, host=172.16.0.74
- [Parallel] - UserSSH: user=tidb, host=172.16.0.75
- [Parallel] - UserSSH: user=tidb, host=172.16.0.61
- [Parallel] - UserSSH: user=tidb, host=172.16.0.76
- [Parallel] - UserSSH: user=tidb, host=172.16.0.71
- [Parallel] - UserSSH: user=tidb, host=172.16.0.64
- [Parallel] - UserSSH: user=tidb, host=172.16.0.65
- [Parallel] - UserSSH: user=tidb, host=172.16.0.150
- [Parallel] - UserSSH: user=tidb, host=172.16.0.150
- [Parallel] - UserSSH: user=tidb, host=172.16.0.150
- [Parallel] - UserSSH: user=tidb, host=172.16.0.66
- [Parallel] - UserSSH: user=tidb, host=172.16.0.63
- [Parallel] - UserSSH: user=tidb, host=172.16.0.150
- [Parallel] - UserSSH: user=tidb, host=172.16.0.150
- [ Serial ] - ClusterOperate: operation=DestroyOperation, options={Roles:[] Nodes:[172.16.0.66:9000] Force:false SSHTimeout:5 OptTimeout:120 APITimeout:300 IgnoreConfigCheck:false NativeSSH:false SSHType: Concurrency:5 SSHProxyHost: SSHProxyPort:22 SSHProxyUser:tidb SSHProxyIdentity:/home/tidb/.ssh/id_rsa SSHProxyUsePassword:false SSHProxyTimeout:5 CleanupData:false CleanupLog:false CleanupAuditLog:false RetainDataRoles:[] RetainDataNodes:[] ShowUptime:false DisplayMode:default Operation:StartOperation}
The component
tiflashwill become tombstone, maybe exists in several minutes or hours, after that you can use the prune command to clean it - [ Serial ] - UpdateMeta: cluster=tidb-m, deleted=
'' - [ Serial ] - UpdateTopology: cluster=tidb-m
- Refresh instance configs
- Generate config pd -> 172.16.0.61:2379 ... Done
- Generate config pd -> 172.16.0.62:2379 ... Done
- Generate config pd -> 172.16.0.63:2379 ... Done
- Generate config tikv -> 172.16.0.71:20160 ... Done
- Generate config tikv -> 172.16.0.72:20160 ... Done
- Generate config tikv -> 172.16.0.73:20160 ... Done
- Generate config tikv -> 172.16.0.74:20160 ... Done
- Generate config tikv -> 172.16.0.75:20160 ... Done
- Generate config tikv -> 172.16.0.76:20160 ... Done
- Generate config tidb -> 172.16.0.150:4000 ... Done
- Generate config tidb -> 172.16.0.150:4001 ... Done
- Generate config tiflash -> 172.16.0.64:9000 ... Done
- Generate config tiflash -> 172.16.0.65:9000 ... Done
- Generate config prometheus -> 172.16.0.150:9090 ... Done
- Generate config grafana -> 172.16.0.150:3000 ... Done
- Generate config alertmanager -> 172.16.0.150:9093 ... Done
- Reload prometheus and grafana
- Reload prometheus -> 172.16.0.150:9090 ... Done
- Reload grafana -> 172.16.0.150:3000 ... Done
Scaled cluster
tidb-min successfully
it is a expect ops , the topo show that:
[tidb@container ~]$ tiup cluster display tidb-m
tiup is checking updates for component cluster ...
Starting component cluster: /home/tidb/.tiup/components/cluster/v1.10.3/tiup-cluster display tidb-m
Cluster type: tidb
Cluster name: tidb-m
Cluster version: v6.2.0
Deploy user: tidb
SSH type: builtin
Dashboard URL: http://172.16.0.62:2379/dashboard
Grafana URL: http://172.16.0.150:3000
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir
172.16.0.150:9093 alertmanager 172.16.0.150 9093/9094 linux/x86_64 Up /home/tidb/data/alertmanager-9093 /home/tidb/deploy/alertmanager-9093
172.16.0.150:3000 grafana 172.16.0.150 3000 linux/x86_64 Up - /home/tidb/deploy/grafana-3000
172.16.0.61:2379 pd 172.16.0.61 2379/2380 linux/x86_64 Up /home/tidb/data/pd-2379 /home/tidb/deploy/pd-2379
172.16.0.62:2379 pd 172.16.0.62 2379/2380 linux/x86_64 Up|UI /home/tidb/data/pd-2379 /home/tidb/deploy/pd-2379
172.16.0.63:2379 pd 172.16.0.63 2379/2380 linux/x86_64 Up|L /home/tidb/data/pd-2379 /home/tidb/deploy/pd-2379
172.16.0.150:9090 prometheus 172.16.0.150 9090/12020 linux/x86_64 Up /data/prometheus-9090 /home/tidb/deploy/prometheus-9090
172.16.0.150:4000 tidb 172.16.0.150 4000/10080 linux/x86_64 Up - /home/tidb/deploy/tidb-4000
172.16.0.150:4001 tidb 172.16.0.150 4001/10081 linux/x86_64 Up - /home/tidb/deploy/tidb-4001
172.16.0.64:9000 tiflash 172.16.0.64 9000/8123/3930/20170/20292/8234 linux/x86_64 Up /home/tidb/data/tiflash-9000 /home/tidb/deploy/tiflash-9000
172.16.0.65:9000 tiflash 172.16.0.65 9000/8123/3930/20170/20292/8234 linux/x86_64 Up /home/tidb/data/tiflash-9000 /home/tidb/deploy/tiflash-9000
172.16.0.66:9000 tiflash 172.16.0.66 9000/8123/3930/20170/20292/8234 linux/x86_64 Tombstone /home/tidb/data/tiflash-9000 /home/tidb/deploy/tiflash-9000
172.16.0.71:20160 tikv 172.16.0.71 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160
172.16.0.72:20160 tikv 172.16.0.72 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160
172.16.0.73:20160 tikv 172.16.0.73 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160
172.16.0.74:20160 tikv 172.16.0.74 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160
172.16.0.75:20160 tikv 172.16.0.75 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160
172.16.0.76:20160 tikv 172.16.0.76 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160
Total nodes: 17
There are some nodes can be pruned:
Nodes: [172.16.0.66:3930]
You can destroy them with the command: tiup cluster prune tidb-m
we can see the Tombstone node.
then
[tidb@container ~]$ tiup cluster prune tidb-m
tiup is checking updates for component cluster ...
Starting component cluster: /home/tidb/.tiup/components/cluster/v1.10.3/tiup-cluster prune tidb-m
- [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-m/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-m/ssh/id_rsa.pub
- [Parallel] - UserSSH: user=tidb, host=172.16.0.72
- [Parallel] - UserSSH: user=tidb, host=172.16.0.61
- [Parallel] - UserSSH: user=tidb, host=172.16.0.73
- [Parallel] - UserSSH: user=tidb, host=172.16.0.62
- [Parallel] - UserSSH: user=tidb, host=172.16.0.75
- [Parallel] - UserSSH: user=tidb, host=172.16.0.76
- [Parallel] - UserSSH: user=tidb, host=172.16.0.150
- [Parallel] - UserSSH: user=tidb, host=172.16.0.74
- [Parallel] - UserSSH: user=tidb, host=172.16.0.150
- [Parallel] - UserSSH: user=tidb, host=172.16.0.64
- [Parallel] - UserSSH: user=tidb, host=172.16.0.65
- [Parallel] - UserSSH: user=tidb, host=172.16.0.66
- [Parallel] - UserSSH: user=tidb, host=172.16.0.150
- [Parallel] - UserSSH: user=tidb, host=172.16.0.150
- [Parallel] - UserSSH: user=tidb, host=172.16.0.150
- [Parallel] - UserSSH: user=tidb, host=172.16.0.63
- [Parallel] - UserSSH: user=tidb, host=172.16.0.71
- [ Serial ] - FindTomestoneNodes Will destroy these nodes: [172.16.0.66:3930] Do you confirm this action? [y/N]:(default=N) y Start destroy Tombstone nodes: [172.16.0.66:3930] ...
- [ Serial ] - ClusterOperate: operation=ScaleInOperation, options={Roles:[] Nodes:[] Force:true SSHTimeout:5 OptTimeout:120 APITimeout:300 IgnoreConfigCheck:true NativeSSH:false SSHType: Concurrency:5 SSHProxyHost: SSHProxyPort:22 SSHProxyUser:tidb SSHProxyIdentity:/home/tidb/.ssh/id_rsa SSHProxyUsePassword:false SSHProxyTimeout:5 CleanupData:false CleanupLog:false CleanupAuditLog:false RetainDataRoles:[] RetainDataNodes:[] ShowUptime:false DisplayMode:default Operation:StartOperation} Stopping component tiflash Stopping instance 172.16.0.66 Stop tiflash 172.16.0.66:9000 success Destroying component tiflash Destroying instance 172.16.0.66 Destroy 172.16.0.66 success
- Destroy tiflash paths: [/home/tidb/data/tiflash-9000 /home/tidb/deploy/tiflash-9000/log /home/tidb/deploy/tiflash-9000 /etc/systemd/system/tiflash-9000.service] Stopping component node_exporter Stopping instance 172.16.0.66 Stop 172.16.0.66 success Stopping component blackbox_exporter Stopping instance 172.16.0.66 Stop 172.16.0.66 success Destroying monitored 172.16.0.66 Destroying instance 172.16.0.66 Destroy monitored on 172.16.0.66 success Delete public key 172.16.0.66 Delete public key 172.16.0.66 success
- [ Serial ] - UpdateMeta: cluster=tidb-m, deleted=
'172.16.0.66:3930' - [ Serial ] - UpdateTopology: cluster=tidb-m
- Refresh instance configs
- Generate config pd -> 172.16.0.61:2379 ... Done
- Generate config pd -> 172.16.0.62:2379 ... Done
- Generate config pd -> 172.16.0.63:2379 ... Done
- Generate config tikv -> 172.16.0.71:20160 ... Done
- Generate config tikv -> 172.16.0.72:20160 ... Done
- Generate config tikv -> 172.16.0.73:20160 ... Done
- Generate config tikv -> 172.16.0.74:20160 ... Done
- Generate config tikv -> 172.16.0.75:20160 ... Done
- Generate config tikv -> 172.16.0.76:20160 ... Done
- Generate config tidb -> 172.16.0.150:4000 ... Done
- Generate config tidb -> 172.16.0.150:4001 ... Done
- Generate config tiflash -> 172.16.0.64:9000 ... Done
- Generate config tiflash -> 172.16.0.65:9000 ... Done
- Generate config tiflash -> 172.16.0.66:9000 ... Error
- Generate config prometheus -> 172.16.0.150:9090 ... Done
- Generate config grafana -> 172.16.0.150:3000 ... Done
- Generate config alertmanager -> 172.16.0.150:9093 ... Done
- Reload prometheus and grafana
- Reload prometheus -> 172.16.0.150:9090 ... Done
- Reload grafana -> 172.16.0.150:3000 ... Done Destroy success
there was a error message :
- Generate config tiflash -> 172.16.0.66:9000 ... Error
- What did you expect to see?
expect : 172.16.0.66:9000 was been scale-in ,it should not been ”Generate config“。
- What did you see instead?
100% reproduce。I scale-in the 9000 port,but the tiup delete the 3930 port :" Will destroy these nodes: [172.16.0.66:3930]" and then generate config for 9000 port.
- What version of TiUP are you using (
tiup --version)?
1.10.2 tiup Go Version: go1.18.3 Git Ref: v1.10.2 GitHash: 2de5b500c9fae6d418fa200ca150b8d5264d6b19
I found the same issue. Looking at the code at
- https://github.com/pingcap/tiup/blob/master/pkg/cluster/operation/destroy.go#L588C38-L588C54 TiFlash is passing FlashServicePort as ID, which is puzzling, yet the prune operation still succeeds.