DBA icon indicating copy to clipboard operation
DBA copied to clipboard

关于模型在cifar上前200轮的训练

Open HazardFY opened this issue 3 years ago • 4 comments

我在cifar10数据集上进行实验,发现如果不直接引入预训练的模型,而是随机初始化模型,且模型在默认的参数下训练的话,模型很难收敛。请问作者在实际训练中是如何设置参数的,出现这种情况的原因是否是由于global model更新的时候,global lr较小导致的?我看论文中设置为0.1,且保持不变

HazardFY avatar Aug 05 '21 11:08 HazardFY

I am having a similar issue despite going with the default parameters. The global model only converges when there is 1 participant (i.e. no FL). If I start from a pre-trained model, after a few FL iterations the model mysteriously unlearns everything. I would very much appreciate some help on this

ehsan886 avatar Jun 07 '22 20:06 ehsan886

Thanks @HazardFY and @ehsan886 for reporting the issue. The pretrained model aims to provide a good initialization and feel free to try different learning rates to get a good pretrained model.

@ehsan886 What does it mean by "model mysteriously unlearns everything"?

AlphaPav avatar Jun 07 '22 21:06 AlphaPav

I have noticed that the global model's test accuracy gradually drops instead of increasing. Although that might be my fault, I need to do some more testing to be 100% sure. I am using pytorch==1.10.0 and I have been facing this issue when running the default code

'DBA/helper.py", line 256, in average_shrink_models data.add_(update_per_layer) RuntimeError: result type Float can't be cast to the desired output type Long'

I did some forced type casting to avert this error which may have caused the problem. Could you please tell me which is the stable pytorch and torchvision version for this code?

[Just for the record, the mentioned error occurs for the following layers: [bn*.num_batches_tracked, layer*.bn*.num_batches_tracked] ]

UPDATE: I solved the problem. Type casting fixes the error for both FedAvg and RFA. For foolsgold, instead of using 'client_grad' to update the global model, I used 'weighted_average_oracle' just like the other aggregations. In any case, this was a very unusual problem to deal with. I believe you can recreate my problem by using "pytorch=1.10.0" and "aggregation_methods=foolsgold". If started without pretrained, the model doesn't converge and if started with pretrained, the model's accuracy gradually falls. My hypothesis is that just updating model parameters that have gradients is not enough for Resnet18 in pytorch=1.10.0, it is necessary to subtract the global model from the local model to get the whole update.

ehsan886 avatar Jun 08 '22 00:06 ehsan886

您好,我遇到了和您类似的问题“在cifar数据集上,使用resnet18网络,如果提前对网络用完整数据集进行预训练,那么当放在联邦学习场景当中时,模型精度会下降;如果不对模型进行预训练,模型不会收敛”。我目前的解决方案是将模型更改为alexnet。我想请问下您具体是怎么解决这个问题的呢?希望您可以抽空回答下我的问题。

pythonloveing avatar Aug 15 '22 08:08 pythonloveing