LG-FedAvg icon indicating copy to clipboard operation
LG-FedAvg copied to clipboard

Failed to converge when changing num_users and frac

Open jkup64 opened this issue 2 years ago • 1 comments

Description

When I change the num_user to 10 and frac to 0.3 with --iid, which means each epoch there are 3 client been choosen, I find the model become better then worse.

Reproduce

$ python main_fed.py --dataset mnist --model mlp --num_classes 10 --epochs 1000 --lr 0.05 --num_users 10 --shard_per_user 2 --frac 0.3 --local_ep 1 --local_bs 8 --results_save run1 --iid

Out

device: cuda:0
MLP(
  (layer_input): Linear(in_features=784, out_features=512, bias=True)
  (relu): ReLU()
  (dropout): Dropout(p=0.5, inplace=False)
  (layer_hidden1): Linear(in_features=512, out_features=256, bias=True)
  (layer_hidden2): Linear(in_features=256, out_features=256, bias=True)
  (layer_hidden3): Linear(in_features=256, out_features=128, bias=True)
  (layer_out): Linear(in_features=128, out_features=10, bias=True)
  (softmax): Softmax(dim=1)
)
Round 0, lr: 0.050000, [5 6 0]
Round   0, Average loss 2.038, Test loss 1.794, Test accuracy: 67.63
Round 1, lr: 0.050000, [6 4 5]
Round   1, Average loss 1.748, Test loss 1.611, Test accuracy: 85.05
Round 2, lr: 0.050000, [7 9 4]
Round   2, Average loss 1.761, Test loss 1.717, Test accuracy: 74.39
Round 3, lr: 0.050000, [7 4 9]
Round   3, Average loss 1.856, Test loss 1.843, Test accuracy: 61.74
Round 4, lr: 0.050000, [9 2 5]
Round   4, Average loss 1.948, Test loss 1.863, Test accuracy: 59.83
Round 5, lr: 0.050000, [2 6 7]
Round   5, Average loss 2.039, Test loss 1.990, Test accuracy: 47.11
Round 6, lr: 0.050000, [0 7 2]
Round   6, Average loss 2.025, Test loss 1.997, Test accuracy: 46.39
Round 7, lr: 0.050000, [4 3 2]
Round   7, Average loss 2.017, Test loss 2.104, Test accuracy: 35.68
Round 8, lr: 0.050000, [2 9 1]
Round   8, Average loss 2.128, Test loss 2.113, Test accuracy: 34.82
Round 9, lr: 0.050000, [2 7 5]
Round   9, Average loss 2.127, Test loss 2.190, Test accuracy: 27.09
Round 10, lr: 0.050000, [1 9 7]
Round  10, Average loss 2.194, Test loss 2.239, Test accuracy: 22.21
Round 11, lr: 0.050000, [0 2 3]
Round  11, Average loss 2.236, Test loss 2.186, Test accuracy: 27.53
Round 12, lr: 0.050000, [3 9 5]
Round  12, Average loss 2.188, Test loss 2.108, Test accuracy: 35.29
Round 13, lr: 0.050000, [3 6 5]
Round  13, Average loss 2.172, Test loss 2.237, Test accuracy: 22.45
Round 14, lr: 0.050000, [9 8 4]
Round  14, Average loss 2.258, Test loss 2.175, Test accuracy: 28.61
Round 15, lr: 0.050000, [2 7 1]
Round  15, Average loss 2.178, Test loss 2.161, Test accuracy: 29.99
Round 16, lr: 0.050000, [9 6 4]
Round  16, Average loss 2.192, Test loss 2.280, Test accuracy: 18.10
Round 17, lr: 0.050000, [2 4 0]
Round  17, Average loss 2.284, Test loss 2.125, Test accuracy: 33.60
Round 18, lr: 0.050000, [4 1 0]
Round  18, Average loss 2.226, Test loss 2.352, Test accuracy: 10.94
Round 19, lr: 0.050000, [6 0 7]
Round  19, Average loss 2.355, Test loss 2.352, Test accuracy: 10.94
Round 20, lr: 0.050000, [1 8 6]
Round  20, Average loss 2.351, Test loss 2.339, Test accuracy: 12.24
Round 21, lr: 0.050000, [1 2 3]
Round  21, Average loss 2.338, Test loss 2.339, Test accuracy: 12.24
Round 22, lr: 0.050000, [9 3 1]
Round  22, Average loss 2.340, Test loss 2.339, Test accuracy: 12.24
Round 23, lr: 0.050000, [4 2 0]
Round  23, Average loss 2.337, Test loss 2.339, Test accuracy: 12.24
Round 24, lr: 0.050000, [8 1 5]

jkup64 avatar Apr 21 '22 07:04 jkup64

You can solve this problem simply by set lr_decay =0.95 and replae

w_local, loss = local.train(net=net_local.to(args.device))

with

w_local, loss = local.train(net=net_local.to(args.device), lr=lr)

or choose other powerful optim rather than SGD

jkup64 avatar Apr 21 '22 09:04 jkup64