TDengine Failed to drop node，always show dropping

构建集群，加了两个节点

taos> show dnodes;
   id   |           end_point            | vnodes | cores  |   status   | role  |       create_time       |      offline reason      |
======================================================================================================================================
      1 | c1:6030                        |      2 |      4 | ready      | any   | 2021-08-09 07:22:56.887 |                          |
      3 | c2:6030                        |      1 |      4 | ready      | any   | 2021-08-09 08:45:27.160 |                          |
      0 | arbitrator:6030                |      0 |      0 | offline    | arb   | 2021-08-09 10:01:20.084 | -                        |
Query OK, 3 row(s) in set (0.001770s)

创建了一个数据库，副本数设置为2

create database t1 replica  2

放了点数据在里面。然后我把节点c2:6030的服务关了（想模拟服务器crash的场景）

尝试删节点报错：

taos> drop dnode "c2:6030";

DB error: Out of DNodes (0.000533s)

尝试把database的副本数改为1

taos> ALTER DATABASE syslogmd REPLICA 1;
Query OK, 0 of 0 row(s) in database (1.225641s)

再尝试删除节点

taos> drop dnode "c2:6030";
Query OK, 0 of 0 row(s) in database (0.000672s)

没有报错。

查看dnode状态：

taos> show dnodes;
   id   |           end_point            | vnodes | cores  |   status   | role  |       create_time       |      offline reason      |
======================================================================================================================================
      1 | c1:6030                        |      2 |      4 | ready      | any   | 2021-08-09 07:22:56.887 |                          |
      3 | c2:6030                        |      1 |      4 | dropping   | any   | 2021-08-09 08:45:27.160 | status not received      |
      0 | arbitrator:6030                |      0 |      0 | offline    | arb   | 2021-08-09 11:08:10.483 | -                        |
Query OK, 3 row(s) in set (0.001164s)

接着就一直处于dropping状态。即使我把c2:6030节点再启动起来也没用。重启c1:6030也没用。

一些错误日志

08/09 11:08:40.717213 00003393 SYN vgId:2, nodeId:0, TCP link is broken since Success, pfd:40 sfd:-1
08/09 11:08:40.717218 00003393 SYN vgId:2, nodeId:0, restart peer connection, last sstatus:init
08/09 11:08:40.717222 00003393 SYN vgId:2, nodeId:0, pfd:-1 sfd:-1 will be closed
08/09 11:08:40.717226 00003393 SYN vgId:2, nodeId:0, peer conn is restart and set sstatus:init
08/09 11:08:40.717229 00003393 SYN vgId:2, nodeId:0, check peer connection in 1000 ms
08/09 11:08:40.717235 00003393 SYN vgId:2, nodeId:0, peer role:unsynced change to offline
08/09 11:08:40.717239 00003393 SYN vgId:2, peer:vgId:2, nodeId:1 is master, index:0
08/09 11:08:40.717242 00003393 SYN vgId:2, nodeId:1, it is the master, replica:1 sver:787
08/09 11:08:40.717246 00003393 SYN vgId:2, roles changed, broadcast status, replica:1
08/09 11:08:40.717252 00003393 SYN 0x7f3450365720 fd:40 is removed from epoll thread, num:1
08/09 11:08:41.071643 00003374 MND vgId:3, replica:1 numOfVnodes:2, try remove one vnode
08/09 11:08:41.221637 00003374 MND vgId:3, replica:1 numOfVnodes:2, try remove one vnode
08/09 11:08:41.334033 00003394 SYN vgId:1, nodeId:3, status is received, self:master:init:35, peer:slave:35, ack:1 tranId:48947 type:broadcast pfd:31
08/09 11:08:41.334053 00003394 SYN vgId:1, nodeId:3, peer role:slave change to slave
08/09 11:08:41.334058 00003394 SYN vgId:1, peer:vgId:1, nodeId:1 is master, index:0
08/09 11:08:41.334063 00003394 SYN vgId:1, nodeId:1, it is the master, replica:2 sver:35
08/09 11:08:41.334096 00003394 SYN vgId:1, nodeId:3, status is sent, self:master:init:35, peer:slave:init:35, ack:0 tranId:48947 type:broadcast-rsp pfd:31
08/09 11:08:41.334307 00003393 SYN vgId:3, nodeId:3, status is received, self:master:init:4, peer:slave:4, ack:1 tranId:5605 type:broadcast pfd:34
08/09 11:08:41.334326 00003393 SYN vgId:3, nodeId:3, peer role:slave change to slave
08/09 11:08:41.334331 00003393 SYN vgId:3, peer:vgId:3, nodeId:1 is master, index:1
08/09 11:08:41.334336 00003393 SYN vgId:3, nodeId:1, it is the master, replica:2 sver:4
08/09 11:08:41.334408 00003393 SYN vgId:3, nodeId:3, status is sent, self:master:init:4, peer:slave:init:4, ack:0 tranId:5605 type:broadcast-rsp pfd:34

公司产品是HA的，想创建两个节点分别部署在HA的master和slave上，目前做技术预研，HA可以取消构建即使另一个节点已经crash。不知道tdengine是否能做。

Aug 09 '21 11:08 korimas

尝试了离线超时，触发自动删除，也是一直处于dropping状态

taos> show dnodes;
   id   |           end_point            | vnodes | cores  |   status   | role  |       create_time       |      offline reason      |
======================================================================================================================================
      1 | c1:6030                        |      2 |      4 | ready      | any   | 2021-08-10 03:36:53.400 |                          |
      2 | c2:6030                        |      1 |      4 | dropping   | any   | 2021-08-10 03:37:33.678 | status msg timeout       |
      0 | arbitrator:6042                |      0 |      0 | ready      | arb   | 2021-08-11 01:39:52.877 | -                        |

taos> select server_version();
 server_version() |
===================
 2.1.3.2          |
Query OK, 1 row(s) in set (0.000867s)

触发自动删除后的日志

08/11 02:05:38.381525 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:38.991500 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:39.391522 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:39.996554 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:40.396565 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:41.006674 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:41.406666 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:42.011583 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:42.411529 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:43.021582 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:43.421528 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:44.026536 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:44.426507 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:45.036522 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:45.436472 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:46.041508 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:46.441511 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:47.046600 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:47.451452 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:48.051511 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:48.456437 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:49.056490 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:49.466640 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:50.061543 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:50.471493 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:51.066671 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:51.481552 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:52.076554 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:52.486519 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:53.081516 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)

Aug 11 '21 02:08 korimas

在二楼的基础上，然后我把c1:6030节点的taosd服务重启了，注意是ready的那个节点。重启正常，查看服务状态正常。

[root@c1 log]# service taosd status
Redirecting to /bin/systemctl status taosd.service
● taosd.service - TDengine server service
   Loaded: loaded (/etc/systemd/system/taosd.service; enabled; vendor preset: disabled)
  Drop-In: /run/systemd/system/taosd.service.d
           └─zzz-lxc-service.conf
   Active: active (running) since Wed 2021-08-11 02:08:11 UTC; 11s ago
  Process: 5354 ExecStartPre=/usr/local/taos/bin/startPre.sh (code=exited, status=0/SUCCESS)
 Main PID: 5360 (taosd)
   CGroup: /system.slice/taosd.service
           └─5360 /usr/bin/taosd

Aug 11 02:08:11 c1 systemd[1]: Starting TDengine server service...
Aug 11 02:08:11 c1 systemd[1]: Started TDengine server service.
Aug 11 02:08:11 c1 TDengine:[5360]: Starting TDengine service...
Aug 11 02:08:11 c1 TDengine:[5360]: Started TDengine service successfully.

但是此时show dnodes;的结果变成了

taos> show dnodes;
   id   |           end_point            | vnodes | cores  |   status   | role  |       create_time       |      offline reason      |
======================================================================================================================================
      1 | c1:6030                        |      2 |      4 | offline    | any   | 2021-08-10 03:36:53.400 | offThreshold not match   |
      2 | c2:6030                        |      1 |      4 | dropping   | any   | 2021-08-10 03:37:33.678 | status not received      |
      0 | arbitrator:6042                |      0 |      0 | ready      | arb   | 2021-08-11 02:06:56.105 | -                        |
Query OK, 3 row(s) in set (0.000948s)

ready变成offline了。

最终的结果是：

taos> show dnodes;
   id   |           end_point            | vnodes | cores  |   status   | role  |       create_time       |      offline reason      |
======================================================================================================================================
      1 | c1:6030                        |      2 |      4 | dropping   | any   | 2021-08-10 03:36:53.400 | status not received      |
      2 | c2:6030                        |      1 |      4 | dropping   | any   | 2021-08-10 03:37:33.678 | status not received      |
      0 | arbitrator:6042                |      0 |      0 | ready      | arb   | 2021-08-11 02:08:11.246 | -                        |
Query OK, 3 row(s) in set (0.002030s)

没办法恢复，只能清空了数据目录重新开始吗？

Aug 11 '21 02:08 korimas

您好，关于您这个测试步骤，我们是需要了解您的数据分布的才能给出结论的。所以可以添加微信15652223354，我们详细了解场景后再沟通。

Aug 15 '21 00:08 yu285

@yu285 @korimas 后续怎么解决这个问题的啊，我也是其中一个节点一直这样，但是不影响使用，感觉像假死一样 Uploading Snipaste_2024-01-30_11-04-00.png…

Jan 30 '24 03:01 hanchao131415

2.0 版本已经不再维护了，有具体问题可以加微信a15652223354沟通

Apr 30 '24 06:04 yu285