bluechi icon indicating copy to clipboard operation
bluechi copied to clipboard

controller: Proactively disconnect node based on heartbeat

Open ueno opened this issue 1 year ago • 3 comments

This adds a couple of new options to controller: HeartbeatInterval and NodeHeartbeatThreshold, which can be used to actively disconnect node based on the last sent heartbeat. The controller periodically checks the last seen timestamp of nodes, and if it was sent before NodeHeartbeatThreshold, the controller treats it as disconnected.

Fixes: #857

ueno avatar Apr 08 '24 10:04 ueno

This is an early draft of #857, where I have a couple of questions to clarify before writing the integration tests:

  • Do we want to send D-Bus signal from the controller as well?
  • In the Disconnected handler (node_disconnected), there is a conditional depending on node->name with additional logic to controller_remove_node. Can't this be consolidated with controller_remove_node?
  • The last_seen timestamp was previously recorded as seconds; now we need to record it as at least milliseconds for the new options. Should we unify the representation to struct timespec?

Sorry if these are too newbie questions, but I would appreciate any feedback.

ueno avatar Apr 08 '24 10:04 ueno

Sorry for the late reply @ueno

Sorry if these are too newbie questions, but I would appreciate any feedback.

No worries about that. We are always happy to help :)

Do we want to send D-Bus signal from the controller as well?

When we force node disconnect, we'll emit signals on the controller and the agent that it disconnected. I think these are sufficient. We could add a reason for disconnecting, but I'd consider this out of scope for this task (maybe a nice future enhancement).

In the Disconnected handler (node_disconnected), there is a conditional depending on node->name with additional logic to controller_remove_node. Can't this be consolidated with controller_remove_node?

Maybe it could be consolidated, but that could become a quite tricky change. Since it is not necessary for the issue this PR tries to implement, I'd rather evaluate and tackle it in a different issue.

The last_seen timestamp was previously recorded as seconds; now we need to record it as at least milliseconds for the new options. Should we unify the representation to struct timespec?

Makes sense to me. @mkemel What do you think?

engelmi avatar Apr 11 '24 09:04 engelmi

Coverage Status

coverage: 84.218% (-0.6%) from 84.834% when pulling e6a0f1bfc20b49675363b9185749cc5178ed58bc on ueno:wip/controller-heartbeat into 81463c4ae1a2b12090430a16f1d1aac32b7e4a43 on eclipse-bluechi:main.

coveralls avatar Apr 11 '24 10:04 coveralls

The RFE this PR targets has been implemented in #911. Closing.

engelmi avatar Jun 26 '24 10:06 engelmi