Failed to drop node,always show dropping
构建集群,加了两个节点
taos> show dnodes;
id | end_point | vnodes | cores | status | role | create_time | offline reason |
======================================================================================================================================
1 | c1:6030 | 2 | 4 | ready | any | 2021-08-09 07:22:56.887 | |
3 | c2:6030 | 1 | 4 | ready | any | 2021-08-09 08:45:27.160 | |
0 | arbitrator:6030 | 0 | 0 | offline | arb | 2021-08-09 10:01:20.084 | - |
Query OK, 3 row(s) in set (0.001770s)
创建了一个数据库,副本数设置为2
create database t1 replica 2
放了点数据在里面。然后我把节点c2:6030的服务关了(想模拟服务器crash的场景)
尝试删节点报错:
taos> drop dnode "c2:6030";
DB error: Out of DNodes (0.000533s)
尝试把database的副本数改为1
taos> ALTER DATABASE syslogmd REPLICA 1;
Query OK, 0 of 0 row(s) in database (1.225641s)
再尝试删除节点
taos> drop dnode "c2:6030";
Query OK, 0 of 0 row(s) in database (0.000672s)
没有报错。
查看dnode状态:
taos> show dnodes;
id | end_point | vnodes | cores | status | role | create_time | offline reason |
======================================================================================================================================
1 | c1:6030 | 2 | 4 | ready | any | 2021-08-09 07:22:56.887 | |
3 | c2:6030 | 1 | 4 | dropping | any | 2021-08-09 08:45:27.160 | status not received |
0 | arbitrator:6030 | 0 | 0 | offline | arb | 2021-08-09 11:08:10.483 | - |
Query OK, 3 row(s) in set (0.001164s)
接着就一直处于dropping状态。即使我把c2:6030节点再启动起来也没用。重启c1:6030也没用。
一些错误日志
08/09 11:08:40.717213 00003393 SYN vgId:2, nodeId:0, TCP link is broken since Success, pfd:40 sfd:-1
08/09 11:08:40.717218 00003393 SYN vgId:2, nodeId:0, restart peer connection, last sstatus:init
08/09 11:08:40.717222 00003393 SYN vgId:2, nodeId:0, pfd:-1 sfd:-1 will be closed
08/09 11:08:40.717226 00003393 SYN vgId:2, nodeId:0, peer conn is restart and set sstatus:init
08/09 11:08:40.717229 00003393 SYN vgId:2, nodeId:0, check peer connection in 1000 ms
08/09 11:08:40.717235 00003393 SYN vgId:2, nodeId:0, peer role:unsynced change to offline
08/09 11:08:40.717239 00003393 SYN vgId:2, peer:vgId:2, nodeId:1 is master, index:0
08/09 11:08:40.717242 00003393 SYN vgId:2, nodeId:1, it is the master, replica:1 sver:787
08/09 11:08:40.717246 00003393 SYN vgId:2, roles changed, broadcast status, replica:1
08/09 11:08:40.717252 00003393 SYN 0x7f3450365720 fd:40 is removed from epoll thread, num:1
08/09 11:08:41.071643 00003374 MND vgId:3, replica:1 numOfVnodes:2, try remove one vnode
08/09 11:08:41.221637 00003374 MND vgId:3, replica:1 numOfVnodes:2, try remove one vnode
08/09 11:08:41.334033 00003394 SYN vgId:1, nodeId:3, status is received, self:master:init:35, peer:slave:35, ack:1 tranId:48947 type:broadcast pfd:31
08/09 11:08:41.334053 00003394 SYN vgId:1, nodeId:3, peer role:slave change to slave
08/09 11:08:41.334058 00003394 SYN vgId:1, peer:vgId:1, nodeId:1 is master, index:0
08/09 11:08:41.334063 00003394 SYN vgId:1, nodeId:1, it is the master, replica:2 sver:35
08/09 11:08:41.334096 00003394 SYN vgId:1, nodeId:3, status is sent, self:master:init:35, peer:slave:init:35, ack:0 tranId:48947 type:broadcast-rsp pfd:31
08/09 11:08:41.334307 00003393 SYN vgId:3, nodeId:3, status is received, self:master:init:4, peer:slave:4, ack:1 tranId:5605 type:broadcast pfd:34
08/09 11:08:41.334326 00003393 SYN vgId:3, nodeId:3, peer role:slave change to slave
08/09 11:08:41.334331 00003393 SYN vgId:3, peer:vgId:3, nodeId:1 is master, index:1
08/09 11:08:41.334336 00003393 SYN vgId:3, nodeId:1, it is the master, replica:2 sver:4
08/09 11:08:41.334408 00003393 SYN vgId:3, nodeId:3, status is sent, self:master:init:4, peer:slave:init:4, ack:0 tranId:5605 type:broadcast-rsp pfd:34
公司产品是HA的,想创建两个节点分别部署在HA的master和slave上,目前做技术预研,HA可以取消构建即使另一个节点已经crash。不知道tdengine是否能做。
尝试了离线超时,触发自动删除,也是一直处于dropping状态
taos> show dnodes;
id | end_point | vnodes | cores | status | role | create_time | offline reason |
======================================================================================================================================
1 | c1:6030 | 2 | 4 | ready | any | 2021-08-10 03:36:53.400 | |
2 | c2:6030 | 1 | 4 | dropping | any | 2021-08-10 03:37:33.678 | status msg timeout |
0 | arbitrator:6042 | 0 | 0 | ready | arb | 2021-08-11 01:39:52.877 | - |
taos> select server_version();
server_version() |
===================
2.1.3.2 |
Query OK, 1 row(s) in set (0.000867s)
触发自动删除后的日志
08/11 02:05:38.381525 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:38.991500 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:39.391522 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:39.996554 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:40.396565 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:41.006674 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:41.406666 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:42.011583 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:42.411529 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:43.021582 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:43.421528 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:44.026536 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:44.426507 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:45.036522 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:45.436472 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:46.041508 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:46.441511 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:47.046600 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:47.451452 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:48.051511 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:48.456437 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:49.056490 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:49.466640 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:50.061543 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:50.471493 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:51.066671 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:51.481552 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:52.076554 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:52.486519 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
08/11 02:05:53.081516 00005249 UTL ERROR failed to connect socket, ip:0x1b78570a, port:6040(connect host error)
在二楼的基础上,然后我把c1:6030节点的taosd服务重启了,注意是ready的那个节点。 重启正常,查看服务状态正常。
[root@c1 log]# service taosd status
Redirecting to /bin/systemctl status taosd.service
● taosd.service - TDengine server service
Loaded: loaded (/etc/systemd/system/taosd.service; enabled; vendor preset: disabled)
Drop-In: /run/systemd/system/taosd.service.d
└─zzz-lxc-service.conf
Active: active (running) since Wed 2021-08-11 02:08:11 UTC; 11s ago
Process: 5354 ExecStartPre=/usr/local/taos/bin/startPre.sh (code=exited, status=0/SUCCESS)
Main PID: 5360 (taosd)
CGroup: /system.slice/taosd.service
└─5360 /usr/bin/taosd
Aug 11 02:08:11 c1 systemd[1]: Starting TDengine server service...
Aug 11 02:08:11 c1 systemd[1]: Started TDengine server service.
Aug 11 02:08:11 c1 TDengine:[5360]: Starting TDengine service...
Aug 11 02:08:11 c1 TDengine:[5360]: Started TDengine service successfully.
但是此时show dnodes;的结果变成了
taos> show dnodes;
id | end_point | vnodes | cores | status | role | create_time | offline reason |
======================================================================================================================================
1 | c1:6030 | 2 | 4 | offline | any | 2021-08-10 03:36:53.400 | offThreshold not match |
2 | c2:6030 | 1 | 4 | dropping | any | 2021-08-10 03:37:33.678 | status not received |
0 | arbitrator:6042 | 0 | 0 | ready | arb | 2021-08-11 02:06:56.105 | - |
Query OK, 3 row(s) in set (0.000948s)
ready变成offline了。
最终的结果是:
taos> show dnodes;
id | end_point | vnodes | cores | status | role | create_time | offline reason |
======================================================================================================================================
1 | c1:6030 | 2 | 4 | dropping | any | 2021-08-10 03:36:53.400 | status not received |
2 | c2:6030 | 1 | 4 | dropping | any | 2021-08-10 03:37:33.678 | status not received |
0 | arbitrator:6042 | 0 | 0 | ready | arb | 2021-08-11 02:08:11.246 | - |
Query OK, 3 row(s) in set (0.002030s)
没办法恢复,只能清空了数据目录重新开始吗?
您好,关于您这个测试步骤,我们是需要了解您的数据分布的才能给出结论的。所以可以添加微信15652223354,我们详细了解场景后再沟通。
@yu285 @korimas 后续怎么解决这个问题的啊,我也是其中一个节点一直这样,但是不影响使用,感觉像假死一样
2.0 版本已经不再维护了,有具体问题可以加微信a15652223354沟通