swarm Tracker becomes non-responsive

The Tracker is getting into a non-responsive state for swarm version 3.0.5 under the following circumstance:

The Tracker has messages in the message queue which triggers a broadcast when handled
one or more node goes down
The handler calls :rpc.sbcast which tries to send a message to all nodes, including the down nodes, and therefore only returns after a timeout to the dead nodes. This continues until the nodedown messages are handled.

Our setup is a kubernetes cluster, where we have observed timeouts of 3-6 seconds before it discovers that a node is down. This makes the Tracker non-responsive until the nodedown message is handled, which potentially takes a lot of time.

A hotfix for this, until it is resolved, could be to call :rpc.abcast instead, since the info about the bad nodes are never really used anyway. Are there any issues with this approach? I can't see that it makes any difference other than missing warnings about the bad nodes.

A fix for this could be to look for nodedown messages in the message queue, when bad nodes are discovered, and then handle nodedown messages accordingly. I'm not sure this is way to do it, just a thought.

Oct 02 '17 09:10 malmovich

Definitely an issue we'll want to address. One of the reasons we're using sbcast is because we want the guarantee that when it returns, we know that all of the servers have received the message - with abcast we don't get that guarantee. This may be fine to sacrifice though, since Swarm is intended to deal with lost broadcasts by synchronizing periodically, but it requires some review.

We do prioritize the reception of nodedown messages over others generally, but inspecting the mailbox isn't a bad idea here - that said, it is still a race condition, since we could look at the mailbox just before the message is received and end up in the same situation.

We probably need to evaluate how we can remain responsive to some events, while still blocking on things which require synchronization - there are some practical limits here though, as the state machine is already very complex, and I'm not sure whether it's feasible to break it down into smaller state machines.

@slashdotdash I think it's worth exploring the abcast route first, and if that has issues, we can try and work out a better alternative.

Oct 02 '17 17:10 bitwalker

Sounds good 👍 I will look forward to any further progress :) We have forked swarm and will be trying out with the abcast fix for now, since the blocking issue seems more severe to us than not having the guarantee that sbcast provides. We're using this in production with several thousand concurrent users, so if there are bugs, we will most likely discover them at some point. I will report back if we find any issues.

Oct 03 '17 08:10 malmovich

I believe we're being affected by the same issue, we've got an AWS autoscaling group (using libcluster + libcluster_ec2) and after a few cycles of machines entering/exiting the scaling group, Swarm.Tracker.whereis(name) hangs in the infinite timeout call to GenStateMachine.

I'm likewise going to patch swarm's Tracker with abcast as @malmovich indicated and perform some additional testing

Jan 17 '18 19:01 suddenrushofsushi

@malmovich @suddenrushofsushi

We are using k8s and we are also being affected by the same problem ! How is that abcast working for you so far ?

@bitwalker any plans or ideas on an official fix for this ? This is quite a critical bug that is affecting us since it can halt the Tracker from working !

Sep 04 '18 09:09 picaoao