openfl
openfl copied to clipboard
FedCurv: ./start_director.sh: line 4: 10782 Killed fx director start --disable-tls -c director_config.yaml
When I run my own experiment with the default FedAVG I can run several rounds, even if I can not use a big networks because otherwise I go out of memory, but this is another problem.
When I apply the FedCurv algorithm, my director node goes out memory and outputs this error: ./start_director.sh: line 4: 10782 Killed fx director start --disable-tls -c director_config.yaml
Using htop
on the director node and the envoys node, I can see that the RAM of the envoys is not full, while the RAM of the director node increases round after round, and it never decreases. So, basically the director crashes, while the envoys try to connect to the director without success.
I have tried to apply FedCurv on my own examples, and also using your tutorial notebook Histology, in interactive_api.
Moreover, investigating aggregation_function_obj.pkl
I can see defaultdict(openfl.component.aggregation_functions.weighted_average.WeightedAverage, {'train': <openfl.component.aggregation_functions.fedcurv_weighted_average.FedCurvWeightedAverage at 0x7f0fa412e610>})
while If I watch at the logs in the terminal, I still see openfl.component.aggregation-functions.weighted_average.WeightedAverage
; this is a minor problem, I think that FedCurv is applied (nevertheless the error of this issue) but the terminal is still printing that the aggregation function is the default one.
Hi @CasellaJr, since this issue has been addressed on the slack channel, could you let us know if the issue got resolved?
Yes sure. I solved using 64gb of RAM for my director. Now it works, however I suggest you to improve the memory usage of OpenFL because I think it is a "little" bit strange to need 64GB ram (or also more) to run a resnet18/50.