FedML Cross-Silo Horizontal Loading all datasets to all clientes

From what I can see, on every example for cross-silo FL, the datasets of all clients are loaded into memory. For example here.

It would be nice to add a way for clients to load their individual datasets as it would probably needed in production.

Oct 04 '22 18:10 rAlexandre00

@rAlexandre00 for production, you still use the same dataset structure by only setting the specific client_index's value in each dictionary. For example, rank 1 loads train_data_dict[1] = rank_1_data_set; rank 2 loads train_data_dict[2] = rank_2_data_set. We prefer this design because it's compatible to both simulation in a GPU cluster and real distributed training on data silos.

Oct 06 '22 23:10 chaoyanghe

The FedMLRunner argument "dataset" contains data like "train_data_num" which needs the whole dataset to be calculated. An appropriate example that shows which data of the dataset is needed would be helpful. Or at least some documentation.

Oct 07 '22 20:10 rAlexandre00

It looks like the aggregator just assumes that every client's data is every client. When I did what you suggested (only setting the specific client_index's value in each dictionary), the following error happened on the client: File "/home/x/FedML/python/fedml/cross_silo/client/fedml_trainer.py", line 49, in update_dataset self.train_local = self.train_data_local_dict[client_index] KeyError: 993

This indicates that the server asked this client (which has rank 1) to use data from client 993. From what I can from the code the server just assigns each silo a random clients to train on.

Oct 08 '22 00:10 rAlexandre00

It looks like the aggregator just assumes that every client's data is every client. When I did what you suggested (only setting the specific client_index's value in each dictionary), the following error happened on the client: File "/home/x/FedML/python/fedml/cross_silo/client/fedml_trainer.py", line 49, in update_dataset self.train_local = self.train_data_local_dict[client_index] KeyError: 993

This indicates that the server asked this client (which has rank 1) to use data from client 993. From what I can from the code the server just assigns each silo a random clients to train on.

I am confused by your intention. If your goal is to simulate thousands of users, please set all key-value for the dataset dictionary, then FedML engine will help you to switch/sample clients per round in a fixed number of workers. If your goal is to run in a real-world setting, each FL client/worker should only hold a single data silo (the question you asked).

The error means you enable sampling (e.g., sample 10 clients per round from 1000 clients based on only 10 client data), but you don't provide these clients' datasets in the "dataset" argument of FedMLRunner.

Oct 10 '22 04:10 chaoyanghe

Sorry, probably I did not explain my problem well enough. My goal is to run in a real-world setting where, as you said, each FL client/worker will only hold a single data silo. I undestand that my problem is that I have sampling enabled. I am currently following this example. Do you have any example where it is disabled?

Oct 10 '22 13:10 rAlexandre00

@rAlexandre00 Hi for production, if you want to do training on 10 clients with 10 isolated devices, you can set this two arguments in "fedml_config.yaml" the same as:

client_num_in_total: 10 client_num_per_round: 10

when client_num_per_round < client_num_in_total, the client sampling is enabled.

Nov 07 '22 18:11 chaoyanghe