tiup icon indicating copy to clipboard operation
tiup copied to clipboard

tiup cluster prune (tombstone TiFlash)ERROR

Open Minorli opened this issue 3 years ago • 1 comments

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? Topo of tidb-m cluster: [tidb@container ~]$ tiup cluster display tidb-m tiup is checking updates for component cluster ... Starting component cluster: /home/tidb/.tiup/components/cluster/v1.10.3/tiup-cluster display tidb-m Cluster type: tidb Cluster name: tidb-m Cluster version: v6.2.0 Deploy user: tidb SSH type: builtin Dashboard URL: http://172.16.0.62:2379/dashboard Grafana URL: http://172.16.0.150:3000 ID Role Host Ports OS/Arch Status Data Dir Deploy Dir

172.16.0.150:9093 alertmanager 172.16.0.150 9093/9094 linux/x86_64 Up /home/tidb/data/alertmanager-9093 /home/tidb/deploy/alertmanager-9093 172.16.0.150:3000 grafana 172.16.0.150 3000 linux/x86_64 Up - /home/tidb/deploy/grafana-3000 172.16.0.61:2379 pd 172.16.0.61 2379/2380 linux/x86_64 Up /home/tidb/data/pd-2379 /home/tidb/deploy/pd-2379 172.16.0.62:2379 pd 172.16.0.62 2379/2380 linux/x86_64 Up|UI /home/tidb/data/pd-2379 /home/tidb/deploy/pd-2379 172.16.0.63:2379 pd 172.16.0.63 2379/2380 linux/x86_64 Up|L /home/tidb/data/pd-2379 /home/tidb/deploy/pd-2379 172.16.0.150:9090 prometheus 172.16.0.150 9090/12020 linux/x86_64 Up /data/prometheus-9090 /home/tidb/deploy/prometheus-9090 172.16.0.150:4000 tidb 172.16.0.150 4000/10080 linux/x86_64 Up - /home/tidb/deploy/tidb-4000 172.16.0.150:4001 tidb 172.16.0.150 4001/10081 linux/x86_64 Up - /home/tidb/deploy/tidb-4001 172.16.0.64:9000 tiflash 172.16.0.64 9000/8123/3930/20170/20292/8234 linux/x86_64 Up /home/tidb/data/tiflash-9000 /home/tidb/deploy/tiflash-9000 172.16.0.65:9000 tiflash 172.16.0.65 9000/8123/3930/20170/20292/8234 linux/x86_64 Up /home/tidb/data/tiflash-9000 /home/tidb/deploy/tiflash-9000 172.16.0.66:9000 tiflash 172.16.0.66 9000/8123/3930/20170/20292/8234 linux/x86_64 Up /home/tidb/data/tiflash-9000 /home/tidb/deploy/tiflash-9000 172.16.0.71:20160 tikv 172.16.0.71 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160 172.16.0.72:20160 tikv 172.16.0.72 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160 172.16.0.73:20160 tikv 172.16.0.73 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160 172.16.0.74:20160 tikv 172.16.0.74 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160 172.16.0.75:20160 tikv 172.16.0.75 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160 172.16.0.76:20160 tikv 172.16.0.76 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160 Total nodes: 17

than scale-in 1 of tiflash node.

[tidb@container ~]$ tiup cluster scale-in tidb-m --node 172.16.0.66:9000 tiup is checking updates for component cluster ... Starting component cluster: /home/tidb/.tiup/components/cluster/v1.10.3/tiup-cluster scale-in tidb-m --node 172.16.0.66:9000 This operation will delete the 172.16.0.66:9000 nodes in tidb-m and all their data. Do you want to continue? [y/N]:(default=N) y The component [tiflash] will become tombstone, maybe exists in several minutes or hours, after that you can use the prune command to clean it Do you want to continue? [y/N]:(default=N) y Scale-in nodes...

  • [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-m/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-m/ssh/id_rsa.pub
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.72
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.62
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.73
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.74
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.75
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.61
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.76
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.71
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.64
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.65
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.150
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.150
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.150
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.66
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.63
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.150
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.150
  • [ Serial ] - ClusterOperate: operation=DestroyOperation, options={Roles:[] Nodes:[172.16.0.66:9000] Force:false SSHTimeout:5 OptTimeout:120 APITimeout:300 IgnoreConfigCheck:false NativeSSH:false SSHType: Concurrency:5 SSHProxyHost: SSHProxyPort:22 SSHProxyUser:tidb SSHProxyIdentity:/home/tidb/.ssh/id_rsa SSHProxyUsePassword:false SSHProxyTimeout:5 CleanupData:false CleanupLog:false CleanupAuditLog:false RetainDataRoles:[] RetainDataNodes:[] ShowUptime:false DisplayMode:default Operation:StartOperation} The component tiflash will become tombstone, maybe exists in several minutes or hours, after that you can use the prune command to clean it
  • [ Serial ] - UpdateMeta: cluster=tidb-m, deleted=''
  • [ Serial ] - UpdateTopology: cluster=tidb-m
  • Refresh instance configs
    • Generate config pd -> 172.16.0.61:2379 ... Done
    • Generate config pd -> 172.16.0.62:2379 ... Done
    • Generate config pd -> 172.16.0.63:2379 ... Done
    • Generate config tikv -> 172.16.0.71:20160 ... Done
    • Generate config tikv -> 172.16.0.72:20160 ... Done
    • Generate config tikv -> 172.16.0.73:20160 ... Done
    • Generate config tikv -> 172.16.0.74:20160 ... Done
    • Generate config tikv -> 172.16.0.75:20160 ... Done
    • Generate config tikv -> 172.16.0.76:20160 ... Done
    • Generate config tidb -> 172.16.0.150:4000 ... Done
    • Generate config tidb -> 172.16.0.150:4001 ... Done
    • Generate config tiflash -> 172.16.0.64:9000 ... Done
    • Generate config tiflash -> 172.16.0.65:9000 ... Done
    • Generate config prometheus -> 172.16.0.150:9090 ... Done
    • Generate config grafana -> 172.16.0.150:3000 ... Done
    • Generate config alertmanager -> 172.16.0.150:9093 ... Done
  • Reload prometheus and grafana
    • Reload prometheus -> 172.16.0.150:9090 ... Done
    • Reload grafana -> 172.16.0.150:3000 ... Done Scaled cluster tidb-m in successfully

it is a expect ops , the topo show that:

[tidb@container ~]$ tiup cluster display tidb-m tiup is checking updates for component cluster ... Starting component cluster: /home/tidb/.tiup/components/cluster/v1.10.3/tiup-cluster display tidb-m Cluster type: tidb Cluster name: tidb-m Cluster version: v6.2.0 Deploy user: tidb SSH type: builtin Dashboard URL: http://172.16.0.62:2379/dashboard Grafana URL: http://172.16.0.150:3000 ID Role Host Ports OS/Arch Status Data Dir Deploy Dir


172.16.0.150:9093 alertmanager 172.16.0.150 9093/9094 linux/x86_64 Up /home/tidb/data/alertmanager-9093 /home/tidb/deploy/alertmanager-9093 172.16.0.150:3000 grafana 172.16.0.150 3000 linux/x86_64 Up - /home/tidb/deploy/grafana-3000 172.16.0.61:2379 pd 172.16.0.61 2379/2380 linux/x86_64 Up /home/tidb/data/pd-2379 /home/tidb/deploy/pd-2379 172.16.0.62:2379 pd 172.16.0.62 2379/2380 linux/x86_64 Up|UI /home/tidb/data/pd-2379 /home/tidb/deploy/pd-2379 172.16.0.63:2379 pd 172.16.0.63 2379/2380 linux/x86_64 Up|L /home/tidb/data/pd-2379 /home/tidb/deploy/pd-2379 172.16.0.150:9090 prometheus 172.16.0.150 9090/12020 linux/x86_64 Up /data/prometheus-9090 /home/tidb/deploy/prometheus-9090 172.16.0.150:4000 tidb 172.16.0.150 4000/10080 linux/x86_64 Up - /home/tidb/deploy/tidb-4000 172.16.0.150:4001 tidb 172.16.0.150 4001/10081 linux/x86_64 Up - /home/tidb/deploy/tidb-4001 172.16.0.64:9000 tiflash 172.16.0.64 9000/8123/3930/20170/20292/8234 linux/x86_64 Up /home/tidb/data/tiflash-9000 /home/tidb/deploy/tiflash-9000 172.16.0.65:9000 tiflash 172.16.0.65 9000/8123/3930/20170/20292/8234 linux/x86_64 Up /home/tidb/data/tiflash-9000 /home/tidb/deploy/tiflash-9000 172.16.0.66:9000 tiflash 172.16.0.66 9000/8123/3930/20170/20292/8234 linux/x86_64 Tombstone /home/tidb/data/tiflash-9000 /home/tidb/deploy/tiflash-9000 172.16.0.71:20160 tikv 172.16.0.71 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160 172.16.0.72:20160 tikv 172.16.0.72 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160 172.16.0.73:20160 tikv 172.16.0.73 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160 172.16.0.74:20160 tikv 172.16.0.74 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160 172.16.0.75:20160 tikv 172.16.0.75 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160 172.16.0.76:20160 tikv 172.16.0.76 20160/20180 linux/x86_64 Up /home/tidb/data/tikv-20160 /home/tidb/deploy/tikv-20160 Total nodes: 17 There are some nodes can be pruned: Nodes: [172.16.0.66:3930] You can destroy them with the command: tiup cluster prune tidb-m

we can see the Tombstone node.

then

[tidb@container ~]$ tiup cluster prune tidb-m tiup is checking updates for component cluster ... Starting component cluster: /home/tidb/.tiup/components/cluster/v1.10.3/tiup-cluster prune tidb-m

  • [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-m/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/tidb-m/ssh/id_rsa.pub
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.72
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.61
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.73
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.62
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.75
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.76
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.150
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.74
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.150
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.64
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.65
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.66
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.150
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.150
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.150
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.63
  • [Parallel] - UserSSH: user=tidb, host=172.16.0.71
  • [ Serial ] - FindTomestoneNodes Will destroy these nodes: [172.16.0.66:3930] Do you confirm this action? [y/N]:(default=N) y Start destroy Tombstone nodes: [172.16.0.66:3930] ...
  • [ Serial ] - ClusterOperate: operation=ScaleInOperation, options={Roles:[] Nodes:[] Force:true SSHTimeout:5 OptTimeout:120 APITimeout:300 IgnoreConfigCheck:true NativeSSH:false SSHType: Concurrency:5 SSHProxyHost: SSHProxyPort:22 SSHProxyUser:tidb SSHProxyIdentity:/home/tidb/.ssh/id_rsa SSHProxyUsePassword:false SSHProxyTimeout:5 CleanupData:false CleanupLog:false CleanupAuditLog:false RetainDataRoles:[] RetainDataNodes:[] ShowUptime:false DisplayMode:default Operation:StartOperation} Stopping component tiflash Stopping instance 172.16.0.66 Stop tiflash 172.16.0.66:9000 success Destroying component tiflash Destroying instance 172.16.0.66 Destroy 172.16.0.66 success
  • Destroy tiflash paths: [/home/tidb/data/tiflash-9000 /home/tidb/deploy/tiflash-9000/log /home/tidb/deploy/tiflash-9000 /etc/systemd/system/tiflash-9000.service] Stopping component node_exporter Stopping instance 172.16.0.66 Stop 172.16.0.66 success Stopping component blackbox_exporter Stopping instance 172.16.0.66 Stop 172.16.0.66 success Destroying monitored 172.16.0.66 Destroying instance 172.16.0.66 Destroy monitored on 172.16.0.66 success Delete public key 172.16.0.66 Delete public key 172.16.0.66 success
  • [ Serial ] - UpdateMeta: cluster=tidb-m, deleted='172.16.0.66:3930'
  • [ Serial ] - UpdateTopology: cluster=tidb-m
  • Refresh instance configs
    • Generate config pd -> 172.16.0.61:2379 ... Done
    • Generate config pd -> 172.16.0.62:2379 ... Done
    • Generate config pd -> 172.16.0.63:2379 ... Done
    • Generate config tikv -> 172.16.0.71:20160 ... Done
    • Generate config tikv -> 172.16.0.72:20160 ... Done
    • Generate config tikv -> 172.16.0.73:20160 ... Done
    • Generate config tikv -> 172.16.0.74:20160 ... Done
    • Generate config tikv -> 172.16.0.75:20160 ... Done
    • Generate config tikv -> 172.16.0.76:20160 ... Done
    • Generate config tidb -> 172.16.0.150:4000 ... Done
    • Generate config tidb -> 172.16.0.150:4001 ... Done
    • Generate config tiflash -> 172.16.0.64:9000 ... Done
    • Generate config tiflash -> 172.16.0.65:9000 ... Done
    • Generate config tiflash -> 172.16.0.66:9000 ... Error
    • Generate config prometheus -> 172.16.0.150:9090 ... Done
    • Generate config grafana -> 172.16.0.150:3000 ... Done
    • Generate config alertmanager -> 172.16.0.150:9093 ... Done
  • Reload prometheus and grafana
    • Reload prometheus -> 172.16.0.150:9090 ... Done
    • Reload grafana -> 172.16.0.150:3000 ... Done Destroy success

there was a error message :

  • Generate config tiflash -> 172.16.0.66:9000 ... Error
  1. What did you expect to see?

expect : 172.16.0.66:9000 was been scale-in ,it should not been ”Generate config“。

  1. What did you see instead?

100% reproduce。I scale-in the 9000 port,but the tiup delete the 3930 port :" Will destroy these nodes: [172.16.0.66:3930]" and then generate config for 9000 port.

  1. What version of TiUP are you using (tiup --version)?

1.10.2 tiup Go Version: go1.18.3 Git Ref: v1.10.2 GitHash: 2de5b500c9fae6d418fa200ca150b8d5264d6b19

Minorli avatar Sep 08 '22 03:09 Minorli

I found the same issue. Looking at the code at

  • https://github.com/pingcap/tiup/blob/master/pkg/cluster/operation/destroy.go#L588C38-L588C54 TiFlash is passing FlashServicePort as ID, which is puzzling, yet the prune operation still succeeds.

BrahmaMantra avatar Aug 06 '25 02:08 BrahmaMantra