Scaling DP-SGD MNIST example

Open sikhapentyala opened this issue 3 years ago • 0 comments

What is your question?

Thank you for providing the framework!

I have a scenario where I have 100-1000 clients and I need to train a LR with DP-SGD in federated setup. I have tried running the DP-SGD MNIST example (https://github.com/adap/flower/tree/main/examples/dp-sgd-mnist) with 10,20, 50 and 100 clients. I could run the example up to 50 clients on an Azure F32sv2 servers (32 vCPUs, 64 GB memory).

I am getting errors when I have more than 50 clients. The program gets stuck and the debug_error_string = "{"created":"@1649881623.129038715","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4142,"referenced_errors":[{"created":"@1649881623.129031215","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":397,"grpc_status":14}]}".

Is the DP-SGD example scalable with more clients or a CNN model? What is an ideal configuration of the simulation server to run this program with 100+ clients?

Thank you.

Apr 13 '22 20:04 sikhapentyala