scheduling icon indicating copy to clipboard operation
scheduling copied to clipboard

InfrastructureManager fields are not updated on Node reconnection

Open lpellegr opened this issue 7 years ago • 0 comments

When a Node is reconnecting, the infrastructure internals are not updated.

Let's say that a Scheduler is deployed along with 2 local nodes. In that case, the LocalInfrastructureManager is used. This infrastructure makes use of a counter to know what is the number of nodes deployed and when to terminate the Node process.

There are situations where a Node is killed (e.g. in TestForkedTaskWorkingDir, only the ProActive node is killed but not the whole process) and detected as down. When a node is detected down, the fields in the associated infrastructure are updated (calling LocalInfrastructure#removeNode). However, since the Node process is still up, it is still able to report its availability. This makes the Scheduler detect a reconnection. During this reconnection, the infrastructure counter is not updated. A few seconds later, while Nodes are pinged, the Scheduler detects, again, the Node down. As a consequence, LocalInfrastructure#removeNode is called, which decrements the internal counter. Furthermore, since the counter is equal to 0, the local Nodes process is killed whereas there is still another node alive :bangbang:

More generally, the problem is about node reconnection that is not reported to the infrastructure associated with the Node that reconnects. A solution is to add a new method in InfrastructureManager class that is called on Node reconnection and overridden by required infrastructures to update their internals.

lpellegr avatar Mar 08 '17 10:03 lpellegr