scheduling
scheduling copied to clipboard
InfrastructureManager fields are not updated on Node reconnection
When a Node is reconnecting, the infrastructure internals are not updated.
Let's say that a Scheduler is deployed along with 2 local nodes. In that case, the LocalInfrastructureManager is used. This infrastructure makes use of a counter to know what is the number of nodes deployed and when to terminate the Node process.
There are situations where a Node is killed (e.g. in TestForkedTaskWorkingDir
, only the ProActive node is killed but not the whole process) and detected as down. When a node is detected down, the fields in the associated infrastructure are updated (calling LocalInfrastructure#removeNode
). However, since the Node process is still up, it is still able to report its availability. This makes the Scheduler detect a reconnection. During this reconnection, the infrastructure counter is not updated. A few seconds later, while Nodes are pinged, the Scheduler detects, again, the Node down. As a consequence, LocalInfrastructure#removeNode
is called, which decrements the internal counter. Furthermore, since the counter is equal to 0, the local Nodes process is killed whereas there is still another node alive :bangbang:
More generally, the problem is about node reconnection that is not reported to the infrastructure associated with the Node that reconnects. A solution is to add a new method in InfrastructureManager
class that is called on Node reconnection and overridden by required infrastructures to update their internals.