FedML
FedML copied to clipboard
FedAvg accuracy stucks under 50
I am training Fedavg to get the benchmark accuracy with the given parameters. But, the accuracy is stuck under 50.
Here is all my code:
!git clone https://github.com/FedML-AI/FedML
cd /content/FedML/fedml_experiments/standalone/fedavg
!python main_fedavg.py --model mobilenet --dataset cifar10 --data_dir ./../../../data/cifar10 --partition_method hetero --comm_round 100 --epochs 20 --batch_size 64 --lr 0.001
I suppose to get over 80% accuracy at least according to these benchmark results.
https://wandb.ai/automl/fedml/runs/390hdz0e
Same issue here!!! I have also tried ResNet56 on Cifar10 with given hyper-parameters, but only got 41% test accuracy with adam optimizer and 20% test accuracy with sgd optimizer.
Our result is based on distributed version. You are running on standalone version. Let me check what's the code difference here.
Same issue here!!! I have also tried ResNet56 on Cifar10 with given hyper-parameters, but only got 41% test accuracy with adam optimizer and 20% test accuracy with sgd optimizer.
Actually I was running the distributed version here. Here is my cmd:
sh run_fedavg_distributed_pytorch.sh 10 10 resnet56 hetero 100 20 64 0.001 cifar10 "./../../../data/cifar10" adam MPI grpc_ipconfig_test.csv 1
sh run_fedavg_distributed_pytorch.sh 10 10 resnet56 hetero 100 20 64 0.001 cifar10 "./../../../data/cifar10" sgd MPI grpc_ipconfig_test.csv 1
@AbdulMoqeet @hangxu0304 Hi, could you have another try with a smaller number of local epochs, e.g. E=1. The large epochs usually make training harder to converge.
@AbdulMoqeet @hangxu0304 Hi, could you have another try with a smaller number of local epochs, e.g. E=1. The large epochs usually make training harder to converge.
Yes, I have also tested Epoch=1 for ResNet56 on Cifar10. Smaller local epochs (e=1) indeed gives better accuracy (57% after 100 rounds), but it is still far away from the benchmark result (87% after 100 rounds). You can check the details in my wandb report.
I tried the code from an early commit, and the accuracy can be reproduced. I guess there might be some inconsistency between the early and latest commit. @chaoyanghe
@hangxu0304 Could you please share hyperparameters or wandb report ? The default hyperparameters (client numbers) are different in both scripts. There is additional parameter (# of local points) in earlier commit.
@hangxu0304 I see. Could you help to figure out the difference?
@hangxu0304 Could you please share hyperparameters or wandb report ? The default hyperparameters (client numbers) are different in both scripts. There is additional parameter (# of local points) in earlier commit.
I was running the distributed version. First, you need to check out this commit. Then:
cd FedML/fedml_experiments/distributed/fedavg
sh run_fedavg_distributed_pytorch.sh 4 3 resnet56 hetero 2000 1 64 0.001 cifar10 ./../../../data/cifar10
/main_fedavg.py --gpu_server_num 4 --gpu_num_per_server 3 --model resnet56 --dataset cifar10 --data_dir ./../../../data/cifar10 --partition_method hetero --client_number 11 --comm_round 2000 --epochs 1 --batch_size 64 --lr 0.001
Other parameters (e.g. local points) are kept as default.
The result is shown on the right side in this report.
@chaoyanghe I did some check. The optimizer, model, and datasets are the same. So, it might be something else.
Great! Thanks for sharing. The report is no longer available.
I've also got the Cifar10 results with epoch 1.
Now, trying to reproduce the CIFAR100.
@hangxu0304 @AbdulMoqeet Hi, I find some possible reasons for this bug: Please check these codes: Original version: https://github.com/FedML-AI/FedML/blob/50d8a45d27675343a7b05a9b31279f6764d3f2ad/fedml_api/standalone/fedavg/fedavg_trainer.py#L45
Current version: https://github.com/FedML-AI/FedML/blob/8ccc24cf2c01b868988f5d5bd65f1666cf5526bc/fedml_api/standalone/fedavg/fedavg_api.py#L64
In the original version, the global model is deepcopied and loaded into clients. However, in the current version, the local client just load the global model (without deepcopy). So maybe the local training in every client will update the global model ?
I'm not completely sure that the bug is cased by this. Could you please change the current codes (make a deepcopy of global model) and to see if the result is correct?
This might be true for the standalone version. But in distributed version, each client only needs to update this global model and then upload it to the server. My previous results showed that distributed version also has this low accuracy issue.
This might be true for the standalone version. But in distributed version, each client only needs to update this global model and then upload it to the server. My previous results showed that distributed version also has this low accuracy issue.
Do you mean you cannot get similar accuracy with the benchmark results of distributed version, even same hyper-parameters?
@AbdulMoqeet @hangxu0304 Hi, could you have another try with a smaller number of local epochs, e.g. E=1. The large epochs usually make training harder to converge.
Yes, I have also tested Epoch=1 for ResNet56 on Cifar10. Smaller local epochs (e=1) indeed gives better accuracy (57% after 100 rounds), but it is still far away from the benchmark result (87% after 100 rounds). You can check the details in my wandb report.
Right. Please check my previous comments.
@hangxu0304 For the distributed implementation, I find these differences:
https://github.com/FedML-AI/FedML/blob/8ccc24cf2c01b868988f5d5bd65f1666cf5526bc/fedml_api/standalone/fedavg/my_model_trainer_classification.py#L44
https://github.com/FedML-AI/FedML/blob/50d8a45d27675343a7b05a9b31279f6764d3f2ad/fedml_api/distributed/fedavg/FedAVGTrainer.py#L29
In original version, there is no grad clip. However, in current version, there is a grad clip. This could be one possible reason. And other things seem to be the same.
I have ran out some new experiments results, verifying that the lack of deepcopy of global model in standalone will indeed induce bugs. But I cannot merge my codes to this version now because I'm waiting for my other experiment results for some new papers. Maybe you can firstly fix them by yourselves for current usage @chaoyanghe @AbdulMoqeet .
@AbdulMoqeet @hangxu0304 @wizard1203 Hi All, what's the final conclusion?
Great! Thanks for sharing. The report is no longer available.
I've also got the Cifar10 results with epoch 1.
Now, trying to reproduce the CIFAR100.
Hi @AbdulMoqeet, have you reproduced the result of CIFAR-100 with local epoch = 1?
Same issue here!!! I have also tried ResNet56 on Cifar10 with given hyper-parameters, but only got 41% test accuracy with adam optimizer and 20% test accuracy with sgd optimizer.
Actually I was running the distributed version here. Here is my cmd:
sh run_fedavg_distributed_pytorch.sh 10 10 resnet56 hetero 100 20 64 0.001 cifar10 "./../../../data/cifar10" adam MPI grpc_ipconfig_test.csv 1
sh run_fedavg_distributed_pytorch.sh 10 10 resnet56 hetero 100 20 64 0.001 cifar10 "./../../../data/cifar10" sgd MPI grpc_ipconfig_test.csv 1
Hi, @hangxu0304, that you got low accuracy because you set ci=1 (the last hyper-parameter as your scrip shows), which is used for the sanity check. When ci=1, we will skip a lot of repeated computing.
def test_on_server_for_all_clients(self, round_idx):
if self.trainer.test_on_the_server(self.train_data_local_dict, self.test_data_local_dict, self.device, self.args):
return
if round_idx % self.args.frequency_of_the_test == 0 or round_idx == self.args.comm_round - 1:
logging.info("################test_on_server_for_all_clients : {}".format(round_idx))
train_num_samples = []
train_tot_corrects = []
train_losses = []
for client_idx in range(self.args.client_num_in_total):
# train data
metrics = self.trainer.test(self.train_data_local_dict[client_idx], self.device, self.args)
train_tot_correct, train_num_sample, train_loss = metrics['test_correct'], metrics['test_total'], metrics['test_loss']
train_tot_corrects.append(copy.deepcopy(train_tot_correct))
train_num_samples.append(copy.deepcopy(train_num_sample))
train_losses.append(copy.deepcopy(train_loss))
"""
Note: CI environment is CPU-based computing.
The training speed for RNN training is to slow in this setting, so we only test a client to make sure there is no programming error.
"""
**if self.args.ci == 1:**
break
@AbdulMoqeet @hangxu0304 @wizard1203 Hi All, what's the final conclusion?
I'm trying to check whether gradient clipping is the cause. The experiment is still running. Let's see.
Great! Thanks for sharing. The report is no longer available. I've also got the Cifar10 results with epoch 1. Now, trying to reproduce the CIFAR100.
Hi @AbdulMoqeet, have you reproduced the result of CIFAR-100 with local epoch = 1?
@chaoyanghe Due to limitation of resources, I was trying on Colab. I have used following command with epochs 20.
./main_fedavg.py --gpu 0 --dataset cifar100 --data_dir ./../../../data/cifar100 --model mobilenet --partition_method hetero --client_number 10 --comm_round 200 --epochs 20 --batch-size 64 --lr 0.001
It achieves reasonable accuracy given that it crashed.
Here is the report: https://wandb.ai/amuqeet/fedml/runs/htvgqh83
@AbdulMoqeet @hangxu0304 @wizard1203 Hi All, what's the final conclusion?
I'm trying to check whether gradient clipping is the cause. The experiment is still running. Let's see.
Have you changed CI to 0? @hangxu0304
Yes.
@chaoyanghe I think CI only affects the training accuracy, not the test accuracy, right?
@chaoyanghe I think CI only affects the training accuracy, not the test accuracy, right?
Both. Let's wait for your result.
@chaoyanghe The left one is the latest commit without gradient clipping, and the right one is the old commit. Both use the same hyper-parameter settings as FedML ResNet56 Cifar10 benchmark. You can check the details in my wandb.
@chaoyanghe The left one is the latest commit without gradient clipping, and the right one is the old commit. Both use the same hyper-parameter settings as FedML ResNet56 Cifar10 benchmark. You can check the details in my wandb.
I cannot access your report.
@chaoyanghe That's strange. I already shared this report, and I can view it without login. Anyway, you can check the results I posted above. I'm wondering if you can obtain a good accuracy by running the latest code on your cluster. And also the result by running old commit doesn't show the same convergence as your benchmark result.
Detailed report cannot be accessed as project is set to private. But, only these two graphs (reports) are accessible.
@hangxu0304 I reproduced the same result with the latest code in both standalone and distributed version. I am not sure what's the difference. If you could make your reports public, it will be very helpful. Thanks in advance.