Task Runner: Intermittent: Aggregator process is killed after restart without any error/exception log
Describe the bug This is an intermittent issue where after restart of aggregator process (by killing the process id and starting it again), it gets killed on its own with no error/exception logs to indicate the reason.
The resiliency test failing because of this is part of PR and PQ pipelines which are otherwise quite stable.
To Reproduce Steps to reproduce the behavior:
- Start the federation with
torch/mnist, 2 collaborators and 10+ rounds. - Ensure that the rounds are increasing.
- Restart aggregator
- Aggregator is silently gone with collaborators running and still trying to connect to it.
Example failures -
-
When only aggregator restarts - https://github.com/securefederatedai/openfl/actions/runs/14839141823/job/41657945065#step:4:205
-
When aggregator and all collaborators restart - https://github.com/securefederatedai/openfl/actions/runs/15014267823/job/42188592296#step:4:322
aggregator.log - where
Starting the Aggregator Service.appears thrice indicating 3 start/restarts, but no error/exception etc.
Expected behavior Irrespective of number/stage of restart for any participant, it should be able to come up and join the federation.
Thanks for spotting this and collecting all the relevant logs, @noopurintel! So the issue occurs when you restart all participants, rather than the aggregator individually, right?
Have you seen it happen when you restart the aggregator only?
Thanks for spotting this and collecting all the relevant logs, @noopurintel! So the issue occurs when you restart all participants, rather than the aggregator individually, right?
Have you seen it happen when you restart the aggregator only?
Yes the problem is with aggregator restart - with or without collaborators restarting. I have updated the bug with both the examples.
Thanks for clarifying, I think this is a more realistic scenario than all nodes suddenly restarting in a distributed (and even decentralized) setup.
While we figure out how to prioritize and assign this issue, could you try replacing the kill with a graceful process stop? Doing so may help isolate the issue, as it is possible that some data corruption occurs due to the sudden interruption...
Update - I replaced the sudden interruption with graceful stop, but couldn't reproduce it so far. It anyways fail 2-3 out of 10 times, so difficult to conclude anything.
I have shared the updated files with @payalcha @noopurintel to get the error log in case when it gets reproduced. Will debug more once get more logs.