controller: Proactively disconnect node based on heartbeat
This adds a couple of new options to controller: HeartbeatInterval and NodeHeartbeatThreshold, which can be used to actively disconnect node based on the last sent heartbeat. The controller periodically checks the last seen timestamp of nodes, and if it was sent before NodeHeartbeatThreshold, the controller treats it as disconnected.
Fixes: #857
This is an early draft of #857, where I have a couple of questions to clarify before writing the integration tests:
- Do we want to send D-Bus signal from the controller as well?
- In the
Disconnectedhandler (node_disconnected), there is a conditional depending onnode->namewith additional logic tocontroller_remove_node. Can't this be consolidated withcontroller_remove_node? - The
last_seentimestamp was previously recorded as seconds; now we need to record it as at least milliseconds for the new options. Should we unify the representation tostruct timespec?
Sorry if these are too newbie questions, but I would appreciate any feedback.
Sorry for the late reply @ueno
Sorry if these are too newbie questions, but I would appreciate any feedback.
No worries about that. We are always happy to help :)
Do we want to send D-Bus signal from the controller as well?
When we force node disconnect, we'll emit signals on the controller and the agent that it disconnected. I think these are sufficient. We could add a reason for disconnecting, but I'd consider this out of scope for this task (maybe a nice future enhancement).
In the Disconnected handler (node_disconnected), there is a conditional depending on node->name with additional logic to controller_remove_node. Can't this be consolidated with controller_remove_node?
Maybe it could be consolidated, but that could become a quite tricky change. Since it is not necessary for the issue this PR tries to implement, I'd rather evaluate and tackle it in a different issue.
The last_seen timestamp was previously recorded as seconds; now we need to record it as at least milliseconds for the new options. Should we unify the representation to struct timespec?
Makes sense to me. @mkemel What do you think?
coverage: 84.218% (-0.6%) from 84.834% when pulling e6a0f1bfc20b49675363b9185749cc5178ed58bc on ueno:wip/controller-heartbeat into 81463c4ae1a2b12090430a16f1d1aac32b7e4a43 on eclipse-bluechi:main.
The RFE this PR targets has been implemented in #911. Closing.