alluxio
alluxio copied to clipboard
New master join cluster and become the leader, the alluxio cluster is not working
Alluxio Version: ALL
Describe the bug Note: ratis can automatically add new nodes to the group.
- When the master node is replaced and becomes the leader, follower and worker not worker, because they are not able to identify the new leader。
- The master address is statically set in conf,After ratis elects the leader successfully, the "alluxio.master.embedded.journal.addresses" in the cluster conf will not be updated
- Because follower and worker are not able to identity leader, The following exception will be reported periodically:
2022-01-14 10:19:42,151 WARN RetryHandlingMetaMasterMasterClient - GetId(address=xxxxx:19998) exits with exception [alluxio.exception.status.UnavailableException: Failed to determine address for MetaMasterMaster after 1 attempts] in 120001 ms (>=10000ms) 2022-01-14 10:19:42,151 ERROR MetaMasterSync - Failed to receive leader master heartbeat command.alluxio.exception.status.UnavailableException: Failed to determine address for MetaMasterMaster after 1 attempts at alluxio.AbstractClient.connect(AbstractClient.java:264) at alluxio.AbstractClient.retryRPCInternal(AbstractClient.java:405) at alluxio.AbstractClient.retryRPC(AbstractClient.java:373) at alluxio.AbstractClient.retryRPC(AbstractClient.java:362) at alluxio.master.meta.RetryHandlingMetaMasterMasterClient.getId(RetryHandlingMetaMasterMasterClient.java:81) at alluxio.master.meta.MetaMasterSync.setIdAndRegister(MetaMasterSync.java:115) at alluxio.master.meta.MetaMasterSync.heartbeat(MetaMasterSync.java:71) at alluxio.heartbeat.HeartbeatThread.run(HeartbeatThread.java:119) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:979) 2022-01-14 10:19:42,152 WARN SleepingTimer - Meta Master Sync last execution took 120002 ms. Longer than the interval 120000
To Reproduce
- alluxio master was replaced
- new master become the leader
Expected behavior Workers and followers can identify the leader normally, and alluxio can provide services normally
Urgency NA
Are you planning to fix it YES
Additional context NA
@jenoudet @tcrain can you take a look?
ping @jenoudet
This is an exception I have seen happen many times, specifically around failovers (between masters of a cluster, or when a new master is added to a cluster). In my experience, it does not affect functionality and is harmless. Have you noticed functionality problems due to this error or simply that the error is written in the logs?
@jenoudet We have many machines that need to be replaced frequently.
- Every time a new master node is added, "alluxio.job.master.embedded.journal.addresses" will be updated to the latest cluster configuration. Thus, ratis is able to participate in elections.
- The alluxio client request address is written into the conf when it is started. Therefore, the old master node cannot recognize the newly added master node (the address of the new master node is not configured in the old master conf)
- When the new master node is elected as the leader, although ratis can work normally, but the alluxio client cannot find the leader, which makes it unable to work
The mismatch comes from the fact that we currently do not support dynamic configuration propagation for Alluxio master addresses. Ratis can and does take into account new masters, but this configuration change current is not propagated to Alluxio. If you want this feature you will have to implement dynamic configuration propagation.
@jenoudet OK, thanks~ I have a few more questions:
- Do you have plans to improve such a scene in the future, or do you have any good ideas?
- we plan:
- Master Client update the address by monitoring the confhash change(RaftJournalSystem.updateGroup update ServerConfiguration.sSonf and hash can be changed)
- Increase the rpc heartbeat to get the master node addresses from the leader and update the client's address
My suggestion would be to look at the MasterInquireClient
. Currently, an Embedded Journal deployment uses the PollingMasterInquireClient
to poll masters to see if they are the leader. A new RaftInquireClient
could be created using a RaftClient
to poll the quorum for the leader information.
@ccy00808 I would suggest instead of putting the static ip address in the configuration, put in the hostname. In kubernetes although the IP of the master pod changes, the hostname doesn't. Other pods should be able to find the new pod with the hostname.
@ccy00808 Any updates on this issue? Are you still encountering the problem?
Offline synced. It's not a problem any more.