FedML FedAvg accuracy stucks under 50

I am training Fedavg to get the benchmark accuracy with the given parameters. But, the accuracy is stuck under 50.

Here is all my code:

!git clone https://github.com/FedML-AI/FedML

cd /content/FedML/fedml_experiments/standalone/fedavg

!python main_fedavg.py --model mobilenet --dataset cifar10 --data_dir ./../../../data/cifar10 --partition_method hetero --comm_round 100 --epochs 20 --batch_size 64 --lr 0.001

I suppose to get over 80% accuracy at least according to these benchmark results.

https://wandb.ai/automl/fedml/runs/390hdz0e

Apr 22 '21 01:04 AbdulMoqeet

Same issue here!!! I have also tried ResNet56 on Cifar10 with given hyper-parameters, but only got 41% test accuracy with adam optimizer and 20% test accuracy with sgd optimizer.

Apr 22 '21 10:04 hangxu0304

Our result is based on distributed version. You are running on standalone version. Let me check what's the code difference here.

Apr 23 '21 21:04 chaoyanghe

Same issue here!!! I have also tried ResNet56 on Cifar10 with given hyper-parameters, but only got 41% test accuracy with adam optimizer and 20% test accuracy with sgd optimizer.

Actually I was running the distributed version here. Here is my cmd:

sh run_fedavg_distributed_pytorch.sh 10 10 resnet56 hetero 100 20 64 0.001 cifar10 "./../../../data/cifar10" adam MPI grpc_ipconfig_test.csv 1

sh run_fedavg_distributed_pytorch.sh 10 10 resnet56 hetero 100 20 64 0.001 cifar10 "./../../../data/cifar10" sgd MPI grpc_ipconfig_test.csv 1

Apr 24 '21 07:04 hangxu0304

@AbdulMoqeet @hangxu0304 Hi, could you have another try with a smaller number of local epochs, e.g. E=1. The large epochs usually make training harder to converge.

Apr 25 '21 07:04 wizard1203

@AbdulMoqeet @hangxu0304 Hi, could you have another try with a smaller number of local epochs, e.g. E=1. The large epochs usually make training harder to converge.

Yes, I have also tested Epoch=1 for ResNet56 on Cifar10. Smaller local epochs (e=1) indeed gives better accuracy (57% after 100 rounds), but it is still far away from the benchmark result (87% after 100 rounds). You can check the details in my wandb report.

Apr 25 '21 08:04 hangxu0304

I tried the code from an early commit, and the accuracy can be reproduced. I guess there might be some inconsistency between the early and latest commit. @chaoyanghe

Apr 25 '21 12:04 hangxu0304

@hangxu0304 Could you please share hyperparameters or wandb report ? The default hyperparameters (client numbers) are different in both scripts. There is additional parameter (# of local points) in earlier commit.

Apr 26 '21 01:04 AbdulMoqeet

@hangxu0304 I see. Could you help to figure out the difference?

Apr 26 '21 04:04 chaoyanghe

@hangxu0304 Could you please share hyperparameters or wandb report ? The default hyperparameters (client numbers) are different in both scripts. There is additional parameter (# of local points) in earlier commit.

I was running the distributed version. First, you need to check out this commit. Then:

cd FedML/fedml_experiments/distributed/fedavg

sh run_fedavg_distributed_pytorch.sh 4 3 resnet56 hetero 2000 1 64 0.001 cifar10 ./../../../data/cifar10

/main_fedavg.py --gpu_server_num 4 --gpu_num_per_server 3 --model resnet56 --dataset cifar10 --data_dir ./../../../data/cifar10 --partition_method hetero --client_number 11 --comm_round 2000 --epochs 1 --batch_size 64 --lr 0.001

Other parameters (e.g. local points) are kept as default.

The result is shown on the right side in this report.

@chaoyanghe I did some check. The optimizer, model, and datasets are the same. So, it might be something else.

Apr 26 '21 07:04 hangxu0304

Great! Thanks for sharing. The report is no longer available.

I've also got the Cifar10 results with epoch 1.

Now, trying to reproduce the CIFAR100.

Apr 27 '21 03:04 AbdulMoqeet

@hangxu0304 @AbdulMoqeet Hi, I find some possible reasons for this bug: Please check these codes: Original version: https://github.com/FedML-AI/FedML/blob/50d8a45d27675343a7b05a9b31279f6764d3f2ad/fedml_api/standalone/fedavg/fedavg_trainer.py#L45

Current version: https://github.com/FedML-AI/FedML/blob/8ccc24cf2c01b868988f5d5bd65f1666cf5526bc/fedml_api/standalone/fedavg/fedavg_api.py#L64

In the original version, the global model is deepcopied and loaded into clients. However, in the current version, the local client just load the global model (without deepcopy). So maybe the local training in every client will update the global model ?

I'm not completely sure that the bug is cased by this. Could you please change the current codes (make a deepcopy of global model) and to see if the result is correct?

Apr 29 '21 08:04 wizard1203

This might be true for the standalone version. But in distributed version, each client only needs to update this global model and then upload it to the server. My previous results showed that distributed version also has this low accuracy issue.

Apr 29 '21 11:04 hangxu0304

This might be true for the standalone version. But in distributed version, each client only needs to update this global model and then upload it to the server. My previous results showed that distributed version also has this low accuracy issue.

Do you mean you cannot get similar accuracy with the benchmark results of distributed version, even same hyper-parameters?

Apr 29 '21 11:04 wizard1203

@AbdulMoqeet @hangxu0304 Hi, could you have another try with a smaller number of local epochs, e.g. E=1. The large epochs usually make training harder to converge.

Yes, I have also tested Epoch=1 for ResNet56 on Cifar10. Smaller local epochs (e=1) indeed gives better accuracy (57% after 100 rounds), but it is still far away from the benchmark result (87% after 100 rounds). You can check the details in my wandb report.

Right. Please check my previous comments.

Apr 29 '21 11:04 hangxu0304

@hangxu0304 For the distributed implementation, I find these differences:

https://github.com/FedML-AI/FedML/blob/8ccc24cf2c01b868988f5d5bd65f1666cf5526bc/fedml_api/standalone/fedavg/my_model_trainer_classification.py#L44

https://github.com/FedML-AI/FedML/blob/50d8a45d27675343a7b05a9b31279f6764d3f2ad/fedml_api/distributed/fedavg/FedAVGTrainer.py#L29

In original version, there is no grad clip. However, in current version, there is a grad clip. This could be one possible reason. And other things seem to be the same.

Apr 29 '21 11:04 wizard1203

I have ran out some new experiments results, verifying that the lack of deepcopy of global model in standalone will indeed induce bugs. But I cannot merge my codes to this version now because I'm waiting for my other experiment results for some new papers. Maybe you can firstly fix them by yourselves for current usage @chaoyanghe @AbdulMoqeet .

Apr 29 '21 12:04 wizard1203

@AbdulMoqeet @hangxu0304 @wizard1203 Hi All, what's the final conclusion?

May 03 '21 16:05 chaoyanghe

Great! Thanks for sharing. The report is no longer available.

I've also got the Cifar10 results with epoch 1.

Now, trying to reproduce the CIFAR100.

Hi @AbdulMoqeet, have you reproduced the result of CIFAR-100 with local epoch = 1?

May 03 '21 19:05 chaoyanghe

Same issue here!!! I have also tried ResNet56 on Cifar10 with given hyper-parameters, but only got 41% test accuracy with adam optimizer and 20% test accuracy with sgd optimizer.

Actually I was running the distributed version here. Here is my cmd:
sh run_fedavg_distributed_pytorch.sh 10 10 resnet56 hetero 100 20 64 0.001 cifar10 "./../../../data/cifar10" adam MPI grpc_ipconfig_test.csv 1
sh run_fedavg_distributed_pytorch.sh 10 10 resnet56 hetero 100 20 64 0.001 cifar10 "./../../../data/cifar10" sgd MPI grpc_ipconfig_test.csv 1

Hi, @hangxu0304, that you got low accuracy because you set ci=1 (the last hyper-parameter as your scrip shows), which is used for the sanity check. When ci=1, we will skip a lot of repeated computing.

    def test_on_server_for_all_clients(self, round_idx):
        if self.trainer.test_on_the_server(self.train_data_local_dict, self.test_data_local_dict, self.device, self.args):
            return

        if round_idx % self.args.frequency_of_the_test == 0 or round_idx == self.args.comm_round - 1:
            logging.info("################test_on_server_for_all_clients : {}".format(round_idx))
            train_num_samples = []
            train_tot_corrects = []
            train_losses = []
            for client_idx in range(self.args.client_num_in_total):
                # train data
                metrics = self.trainer.test(self.train_data_local_dict[client_idx], self.device, self.args)
                train_tot_correct, train_num_sample, train_loss = metrics['test_correct'], metrics['test_total'], metrics['test_loss']
                train_tot_corrects.append(copy.deepcopy(train_tot_correct))
                train_num_samples.append(copy.deepcopy(train_num_sample))
                train_losses.append(copy.deepcopy(train_loss))

                """
                Note: CI environment is CPU-based computing. 
                The training speed for RNN training is to slow in this setting, so we only test a client to make sure there is no programming error.
                """
                **if self.args.ci == 1:**
                    break

May 03 '21 19:05 chaoyanghe

@AbdulMoqeet @hangxu0304 @wizard1203 Hi All, what's the final conclusion?

I'm trying to check whether gradient clipping is the cause. The experiment is still running. Let's see.

May 04 '21 04:05 hangxu0304

Great! Thanks for sharing. The report is no longer available. I've also got the Cifar10 results with epoch 1. Now, trying to reproduce the CIFAR100.

Hi @AbdulMoqeet, have you reproduced the result of CIFAR-100 with local epoch = 1?

@chaoyanghe Due to limitation of resources, I was trying on Colab. I have used following command with epochs 20.

./main_fedavg.py --gpu 0 --dataset cifar100 --data_dir ./../../../data/cifar100 --model mobilenet --partition_method hetero --client_number 10 --comm_round 200 --epochs 20 --batch-size 64 --lr 0.001

It achieves reasonable accuracy given that it crashed.

Here is the report: https://wandb.ai/amuqeet/fedml/runs/htvgqh83

May 04 '21 05:05 AbdulMoqeet

@AbdulMoqeet @hangxu0304 @wizard1203 Hi All, what's the final conclusion?

I'm trying to check whether gradient clipping is the cause. The experiment is still running. Let's see.

Have you changed CI to 0? @hangxu0304

May 04 '21 05:05 chaoyanghe

Yes.

May 04 '21 05:05 hangxu0304

@chaoyanghe I think CI only affects the training accuracy, not the test accuracy, right?

May 04 '21 05:05 hangxu0304

@chaoyanghe I think CI only affects the training accuracy, not the test accuracy, right?

Both. Let's wait for your result.

May 04 '21 05:05 chaoyanghe

@chaoyanghe The left one is the latest commit without gradient clipping, and the right one is the old commit. Both use the same hyper-parameter settings as FedML ResNet56 Cifar10 benchmark. You can check the details in my wandb.

May 04 '21 21:05 hangxu0304

@chaoyanghe The left one is the latest commit without gradient clipping, and the right one is the old commit. Both use the same hyper-parameter settings as FedML ResNet56 Cifar10 benchmark. You can check the details in my wandb.

I cannot access your report.

May 04 '21 22:05 chaoyanghe

@chaoyanghe That's strange. I already shared this report, and I can view it without login. Anyway, you can check the results I posted above. I'm wondering if you can obtain a good accuracy by running the latest code on your cluster. And also the result by running old commit doesn't show the same convergence as your benchmark result.

May 05 '21 12:05 hangxu0304

Detailed report cannot be accessed as project is set to private. But, only these two graphs (reports) are accessible.

May 06 '21 00:05 AbdulMoqeet

@hangxu0304 I reproduced the same result with the latest code in both standalone and distributed version. I am not sure what's the difference. If you could make your reports public, it will be very helpful. Thanks in advance.

May 06 '21 04:05 chaoyanghe

FedML FedML copied to clipboard

FedAvg accuracy stucks under 50

FedML
FedML copied to clipboard