openfl icon indicating copy to clipboard operation
openfl copied to clipboard

FedCurv: ./start_director.sh: line 4: 10782 Killed fx director start --disable-tls -c director_config.yaml

Open CasellaJr opened this issue 2 years ago • 2 comments

When I run my own experiment with the default FedAVG I can run several rounds, even if I can not use a big networks because otherwise I go out of memory, but this is another problem. When I apply the FedCurv algorithm, my director node goes out memory and outputs this error: ./start_director.sh: line 4: 10782 Killed fx director start --disable-tls -c director_config.yaml Using htop on the director node and the envoys node, I can see that the RAM of the envoys is not full, while the RAM of the director node increases round after round, and it never decreases. So, basically the director crashes, while the envoys try to connect to the director without success. I have tried to apply FedCurv on my own examples, and also using your tutorial notebook Histology, in interactive_api. Moreover, investigating aggregation_function_obj.pklI can see defaultdict(openfl.component.aggregation_functions.weighted_average.WeightedAverage, {'train': <openfl.component.aggregation_functions.fedcurv_weighted_average.FedCurvWeightedAverage at 0x7f0fa412e610>}) while If I watch at the logs in the terminal, I still see openfl.component.aggregation-functions.weighted_average.WeightedAverage; this is a minor problem, I think that FedCurv is applied (nevertheless the error of this issue) but the terminal is still printing that the aggregation function is the default one.

CasellaJr avatar Apr 07 '22 15:04 CasellaJr

Hi @CasellaJr, since this issue has been addressed on the slack channel, could you let us know if the issue got resolved?

mansishr avatar Apr 25 '22 13:04 mansishr

Yes sure. I solved using 64gb of RAM for my director. Now it works, however I suggest you to improve the memory usage of OpenFL because I think it is a "little" bit strange to need 64GB ram (or also more) to run a resnet18/50.

CasellaJr avatar Apr 25 '22 13:04 CasellaJr