openfl Task Runner: Intermittent: Aggregator process is killed after restart without any error/exception log

Describe the bug This is an intermittent issue where after restart of aggregator process (by killing the process id and starting it again), it gets killed on its own with no error/exception logs to indicate the reason.

The resiliency test failing because of this is part of PR and PQ pipelines which are otherwise quite stable.

To Reproduce Steps to reproduce the behavior:

Start the federation with torch/mnist, 2 collaborators and 10+ rounds.
Ensure that the rounds are increasing.
Restart aggregator
Aggregator is silently gone with collaborators running and still trying to connect to it.

Example failures -

When only aggregator restarts - https://github.com/securefederatedai/openfl/actions/runs/14839141823/job/41657945065#step:4:205

aggregator.log
When aggregator and all collaborators restart - https://github.com/securefederatedai/openfl/actions/runs/15014267823/job/42188592296#step:4:322

aggregator.log - where Starting the Aggregator Service. appears thrice indicating 3 start/restarts, but no error/exception etc.

Expected behavior Irrespective of number/stage of restart for any participant, it should be able to come up and join the federation.

May 14 '25 08:05 noopurintel

Thanks for spotting this and collecting all the relevant logs, @noopurintel! So the issue occurs when you restart all participants, rather than the aggregator individually, right?

Have you seen it happen when you restart the aggregator only?

May 14 '25 09:05 teoparvanov

Thanks for spotting this and collecting all the relevant logs, @noopurintel! So the issue occurs when you restart all participants, rather than the aggregator individually, right?

Have you seen it happen when you restart the aggregator only?

Yes the problem is with aggregator restart - with or without collaborators restarting. I have updated the bug with both the examples.

May 14 '25 09:05 noopurintel

Thanks for clarifying, I think this is a more realistic scenario than all nodes suddenly restarting in a distributed (and even decentralized) setup.

While we figure out how to prioritize and assign this issue, could you try replacing the kill with a graceful process stop? Doing so may help isolate the issue, as it is possible that some data corruption occurs due to the sudden interruption...

May 14 '25 09:05 teoparvanov

Update - I replaced the sudden interruption with graceful stop, but couldn't reproduce it so far. It anyways fail 2-3 out of 10 times, so difficult to conclude anything.

May 16 '25 06:05 noopurintel

I have shared the updated files with @payalcha @noopurintel to get the error log in case when it gets reproduced. Will debug more once get more logs.

May 19 '25 11:05 rahulga1