ClusterRunner icon indicating copy to clipboard operation
ClusterRunner copied to clipboard

When slave host goes down (but no shutdown command is issued), slave should be marked as is_alive: false

Open tjlee0909 opened this issue 9 years ago • 5 comments

tjlee0909 avatar Jan 12 '16 22:01 tjlee0909

Slaves do get marked as is_alive: false, but not until the master tries to allocate the slave to a build. And then it takes 15 minutes due to #317.

We should have the master occasionally check slave connectivity so that we know slaves' online/offline status before the cluster is under load.

josephharrington avatar Aug 24 '16 19:08 josephharrington

The fix for this may (or may not) be related to #214. Both are related to the master and slaves checking that each other are still up and running while idle.

josephharrington avatar Aug 25 '16 05:08 josephharrington

@nadeemahmad

wjdhollow avatar Aug 26 '16 17:08 wjdhollow

@josephharrington mentioned that to reduce the number of messages the master needs to send, it could keep track of the last time it last heard from a slave to determine which nodes to ping.

wjdhollow avatar Aug 26 '16 19:08 wjdhollow

On the master side, we can keep track of the last time it heard from each slave. Then we'd add a thread on the master that iterates through the list of slaves that it hasn't heard from in awhile and does an is_alive check on each one.

This approach would take advantage of the fact that we don't need to dumbly ping slaves that are actively reporting results. Additionally, addressing #214 would ensure that even when all slaves are idle that the master is only actively pinging slaves that are misbehaving (haven't phoned home recently). In that case, the threshold on the master for when to ping a slave it thinks has died should be greater than whatever interval the slave is configured to phone home.

josephharrington avatar Aug 26 '16 20:08 josephharrington