sofa-jraft icon indicating copy to clipboard operation
sofa-jraft copied to clipboard

Nacos-Server jraft初始化失败,导致集群多节点服务下的实例数不一致,重启节点也无法恢复,最后只能删除data目录

Open guozongkang opened this issue 1 month ago • 6 comments

集群环境: 3台ALiyun ECS 16C 32G Nacos-Server版本: 2.1.2

问题现象: Nacos-Server3台节点已经正常运行了半个月的时候,但是其中一台因为内存问题,我们不得不将其重启,我们将其命名为1节点,另外两台节点分别为2,3节点。 将1节点重启的方式是执行bin目录下的shutdown脚本,然后执行bin下的startup脚本,这个时候我们发现了问题。 从Nacos控制台查看,1节点显示某一个服务有45个实例,2,3节点显示这个服务有65个实例(后经查实,65个实例是正常的)。 也就是说1节点的数据有问题, 我们查看日志。发现 alipay-jraft日志有错误:

2024-06-19 00:16:35,087 WARN Node <naming_persistent_service/10.254.16.7:7848> RequestVote to 10.254.18.46:7848 error: Status[EINTERNAL<1004>: RPC exception:UNKNOWN]. 2024-06-19 00:16:35,707 WARN Fail to issue RPC to 10.254.18.46:7848, consecutiveErrorTimes=1, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:16:35,710 WARN Fail to issue RPC to 10.254.18.46:7848, consecutiveErrorTimes=1, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:16:35,707 WARN Node <naming_persistent_service_v2/10.254.16.7:7848> RequestVote to 10.254.18.46:7848 error: Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception]. 2024-06-19 00:16:35,710 WARN Fail to issue RPC to 10.254.18.46:7848, consecutiveErrorTimes=1, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:16:38,277 WARN Fail to issue RPC to 10.254.18.46:7848, consecutiveErrorTimes=11, error=Status[ENOENT<1012>: Peer id not found: 10.254.18.46:7848, group: naming_service_metadata] 2024-06-19 00:18:21,216 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=1, error=Status[ENOENT<1012>: Peer id not found: 10.254.17.172:7848, group: naming_persistent_service] 2024-06-19 00:18:21,264 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=1, error=Status[ENOENT<1012>: Peer id not found: 10.254.17.172:7848, group: naming_service_metadata] 2024-06-19 00:18:21,266 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=1, error=Status[ENOENT<1012>: Peer id not found: 10.254.17.172:7848, group: naming_persistent_service_v2] 2024-06-19 00:18:26,139 WARN Node <naming_instance_metadata/10.254.16.7:7848> RequestVote to 10.254.17.172:7848 error: Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception]. 2024-06-19 00:18:26,326 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=11, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:18:26,328 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=11, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:18:26,336 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=11, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:18:28,668 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=1, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:18:31,188 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=11, error=Status[EINTERNAL<1004>: Check connection[10.254.17.172:7848] fail and try to create new one] 2024-06-19 00:18:31,360 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=21, error=Status[EINTERNAL<1004>: Check connection[10.254.17.172:7848] fail and try to create new one] 2024-06-19 00:18:31,385 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=21, error=Status[EINTERNAL<1004>: Check connection[10.254.17.172:7848] fail and try to create new one] 2024-06-19 00:18:31,388 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=21, error=Status[EINTERNAL<1004>: Check connection[10.254.17.172:7848] fail and try to create new one] 2024-06-19 00:18:33,710 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=21, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:18:36,225 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=31, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:18:36,400 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=31, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:18:36,424 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=31, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:18:36,449 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=31, error=Status[EINTERNAL<1004>: RPC exception:UNAVAILABLE: io exception] 2024-06-19 00:18:38,786 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=41, error=Status[EINTERNAL<1004>: Check connection[10.254.17.172:7848] fail and try to create new one] 2024-06-19 00:18:41,462 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=41, error=Status[ENOENT<1012>: Peer id not found: 10.254.17.172:7848, group: naming_service_metadata] 2024-06-19 00:18:41,477 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=41, error=Status[ENOENT<1012>: Peer id not found: 10.254.17.172:7848, group: naming_persistent_service_v2] 2024-06-19 00:18:41,530 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=51, error=Status[ENOENT<1012>: Peer id not found: 10.254.17.172:7848, group: naming_instance_metadata] 2024-06-19 00:19:36,094 WARN ThreadId: Replicator [state=Destroyed, statInfo=<running=IDLE, firstLogIndex=171, lastLogIncluded=0, lastLogIndex=171, lastTermIncluded=0>, peerId=10.254.18.46:7848, waitId=2, type=Follower] already destroyed, ignore error code: 1001 2024-06-19 00:19:36,143 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=1, error=Status[EINTERNAL<1004>: RPC exception:DEADLINE_EXCEEDED: deadline exceeded after 2.499983956s. [remote_addr=10.254.17.172/10.254.17.172:7848]] 2024-06-19 00:19:36,272 WARN Fail to issue RPC to 10.254.17.172:7848, consecutiveErrorTimes=1, error=Status[EINTERNAL<1004>: RPC exception:DEADLINE_EXCEEDED: deadline exceeded after 2.499984812s. [remote_addr=10.254.17.172/10.254.17.172:7848]] 2024-06-19 00:19:36,303 WARN ThreadId: Replicator [state=Destroyed, statInfo=<running=IDLE, firstLogIndex=3446087, lastLogIncluded=0, lastLogIndex=3446087, lastTermIncluded=0>, peerId=10.254.18.46:7848, waitId=270, type=Follower] already destroyed, ignore error code: 1001 2024-06-19 00:19:36,501 WARN ThreadId: Replicator [state=Destroyed, statInfo=<running=IDLE, firstLogIndex=72, lastLogIncluded=0, lastLogIndex=72, lastTermIncluded=0>, peerId=10.254.18.46:7848, waitId=2, type=Follower] already destroyed, ignore error code: 1001 [admin@b01_nacos_service_test_hk logs]$ cat alipay-jraft.log|grep ERROR 2024-06-19 00:16:35,666 ERROR Fail to connect 10.254.18.46:7848, remoting exception: java.util.concurrent.TimeoutException. 2024-06-19 00:18:26,134 ERROR Fail to connect 10.254.17.172:7848, remoting exception: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception. 2024-06-19 00:18:26,165 ERROR Fail to connect 10.254.17.172:7848, remoting exception: java.util.concurrent.ExecutionException: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception. 2024-06-19 00:18:26,165 ERROR Fail to init sending channel to 10.254.17.172:7848. 2024-06-19 00:18:26,165 ERROR Fail to start replicator to peer=10.254.17.172:7848, replicatorType=Follower. 2024-06-19 00:18:26,165 ERROR Fail to add a replicator, peer=10.254.17.172:7848.

Protocol-raft日志错误为: 2024-06-19 00:16:35,175 ERROR Fail to refresh route configuration for group : naming_service_metadata, status is : Status[UNKNOWN<-1>: io.grpc.StatusRuntimeException: UNKNOWN] 2024-06-19 00:18:21,467 ERROR Fail to refresh leader for group : naming_instance_metadata, status is : Status[UNKNOWN<-1>: Unknown leader, No nodes in group naming_instance_metadata, Unknown leader] 2024-06-19 00:18:21,469 ERROR Fail to refresh route configuration for group : naming_instance_metadata, status is : Status[ENOENT<1012>: Fail to find node 10.254.17.172:7848 in group naming_instance_metadata]

我们将1节点shutdown10分钟,然后再次重启,问题仍然没有解决。 我们在社区的Isseus翻找,发现之前人提出的问题,和我们很类似,解决方式是删除data目录,然后重启即可。我们照着做,确实解决了问题,但是如何避免这种问题出现呢

guozongkang avatar Jun 26 '24 12:06 guozongkang