ClusterRunner
ClusterRunner copied to clipboard
When slave host goes down (but no shutdown command is issued), slave should be marked as is_alive: false
Slaves do get marked as is_alive: false
, but not until the master tries to allocate the slave to a build. And then it takes 15 minutes due to #317.
We should have the master occasionally check slave connectivity so that we know slaves' online/offline status before the cluster is under load.
The fix for this may (or may not) be related to #214. Both are related to the master and slaves checking that each other are still up and running while idle.
@nadeemahmad
@josephharrington mentioned that to reduce the number of messages the master needs to send, it could keep track of the last time it last heard from a slave to determine which nodes to ping.
On the master side, we can keep track of the last time it heard from each slave. Then we'd add a thread on the master that iterates through the list of slaves that it hasn't heard from in awhile and does an is_alive check on each one.
This approach would take advantage of the fact that we don't need to dumbly ping slaves that are actively reporting results. Additionally, addressing #214 would ensure that even when all slaves are idle that the master is only actively pinging slaves that are misbehaving (haven't phoned home recently). In that case, the threshold on the master for when to ping a slave it thinks has died should be greater than whatever interval the slave is configured to phone home.