CommEfficient
CommEfficient copied to clipboard
Question about the final test_acc in cifar10 experiment.
Hi, I tried to reproduce the experiment results in the paper. I am using the following commands:
python cv_train.py --dataset_name CIFAR10 --model ResNet9 --mode fedavg --num_clients 200 --num_workers 10 --num_rows 1 --num_cols 50000 --error_type none --local_momentum 0.0 --virtual_momentum 0.9 --max_grad_norm 1.0 --num_devices=1 --lr_scale 0 --local_batch_size -1 --share_ps_gpu
the accuracy seems not correct, could you help me solve the problem? the logs are:
MY PID: 3424 Namespace(do_test=False, mode='fedavg', robustagg='none', use_tensorboard=False, seed=21, model='ResNet9', do_finetune=False, do_dp_finetune=False, do_checkpoint=False, checkpoint_path='/data/nvme/ashwinee/CommEfficient/CommEfficient/checkpoints/', finetune_path='./finetune', finetuned_from=None, num_results_train=2, num_results_val=2, dataset_name='CIFAR10', dataset_dir='./dataset', do_batchnorm=False, nan_threshold=999, k=50000, num_cols=50000, num_rows=1, num_blocks=20, do_topk_down=False, local_momentum=0.0, virtual_momentum=0.9, weight_decay=0.0005, num_epochs=24, num_fedavg_epochs=1, fedavg_batch_size=-1, fedavg_lr_decay=1.0, error_type='none', lr_scale=0.4, pivot_epoch=5, port=5315, num_clients=200, num_workers=10, device='cuda', num_devices=1, share_ps_gpu=True, do_iid=False, train_dataloader_workers=0, val_dataloader_workers=0, model_checkpoint='gpt2', num_candidates=2, max_history=2, local_batch_size=-1, valid_batch_size=8, microbatch_size=-1, lm_coef=1.0, mc_coef=1.0, max_grad_norm=1.0, personality_permutations=1, eval_before_start=False, checkpoint_epoch=-1, finetune_epoch=12, do_malicious=False, mal_targets=1, mal_boost=1.0, mal_epoch=0, mal_type=None, do_mal_forecast=False, do_pgd=False, do_data_ownership=False, mal_num_clients=-1, layer_freeze_idx=0, mal_layer_freeze_idx=0, mal_num_epochs=1, backdoor=-1, do_perfect_knowledge=False, do_dp=False, dp_mode='worker', l2_norm_clip=1.0, noise_multiplier=0.0, client_lr=0.1) 50000 125 Using BatchNorm: False grad size 6568640 Finished initializing in 1.91 seconds epoch lr train_time train_loss train_acc test_loss test_acc total_time 1 0.0800 25.2243 2.3028 0.1038 2.3012 0.1405 30.4418 2 0.1600 23.8377 2.3025 0.1017 2.2936 0.1460 57.5426 3 0.2400 24.0562 2.2886 0.1157 2.2449 0.1507 84.8176 4 0.3200 23.0985 2.2479 0.1461 2.1887 0.1535 111.1938 5 0.4000 22.0071 2.2901 0.0944 2.2941 0.0930 136.4487 6 0.3789 21.9321 2.3150 0.1301 3.3015 0.0997 161.6546 7 0.3579 21.9460 2.3782 0.1078 2.2771 0.1324 186.8818 8 0.3368 21.8156 2.2793 0.1264 2.2281 0.1360 211.9292 9 0.3158 21.6892 2.2410 0.1775 2.2307 0.1417 236.9210 10 0.2947 21.9432 2.2989 0.1024 2.2831 0.1175 262.0983 11 0.2737 21.9095 2.2511 0.1332 2.1657 0.1901 287.2876 12 0.2526 27.3621 2.1729 0.1771 2.1231 0.1734 321.2075 13 0.2316 37.6449 2.1274 0.1580 2.1067 0.2008 365.1934 14 0.2105 32.6825 2.3116 0.1308 2.0721 0.2026 401.1535 15 0.1895 22.3018 2.1435 0.1707 2.0014 0.2332 426.7760 16 0.1684 30.7159 2.0729 0.1982 2.1173 0.2312 460.7642 17 0.1474 22.4368 2.1110 0.2006 2.0027 0.2580 489.7420 18 0.1263 39.1600 2.0538 0.1897 2.0412 0.2377 535.3520 19 0.1053 38.9138 2.0614 0.2156 2.0193 0.2655 580.7346 20 0.0842 21.9821 1.9763 0.2441 2.0301 0.2679 605.9769 21 0.0632 32.8850 1.9892 0.2655 2.0524 0.2711 645.6084 22 0.0421 38.2427 1.9478 0.2627 1.8612 0.3094 690.2626 23 0.0211 38.3396 1.9010 0.2778 1.8869 0.2993 735.1110 HACK STEP WARNING: LR is 0 WARNING: LR is 0 24 0.0000 33.1543 1.9016 0.2929 1.8394 0.3032 771.5566 done training
Hi, I'm not sure exactly what experimental results you're trying to reproduce. I'm not sure whether we had any results in the paper that have mode=fedavg, num_clients=200, num_workers=10, local_batch_size=1. Could you tell me what results you're trying to reproduce so I can tell you what hparams to use? Thanks.
Thank you very much for your response. I am trying to reproduce the results:
Could you please tell me what hparams to use? I genuinely appreciate your assistance.
python cv_train.py --dataset_name CIFAR10 --model ResNet9 --mode sketch --num_clients 10000 --num_workers 100 --num_rows 1 --num_cols 50000 --error_type virtual --local_momentum 0.0 --virtual_momentum 0.9 --max_grad_norm 10.0 --num_devices=1 --lr_scale 0.4 --local_batch_size -1 --share_ps_gpu
Thanks very much! I will try it now.
I used the hparams you give, but I got the results: MY PID: 9007 Namespace(do_test=False, mode='sketch', use_tensorboard=False, seed=21, model='ResNet9', do_finetune=False, do_checkpoint=False, checkpoint_path='./checkpoint', finetune_path='./finetune', finetuned_from=None, num_results_train=2, num_results_val=2, dataset_name='CIFAR10', dataset_dir='./dataset', do_batchnorm=False, nan_threshold=999, k=50000, num_cols=50000, num_rows=1, num_blocks=20, do_topk_down=False, local_momentum=0.0, virtual_momentum=0.9, weight_decay=0.0005, num_epochs=24, num_fedavg_epochs=1, fedavg_batch_size=-1, fedavg_lr_decay=1, error_type='virtual', lr_scale=0.4, pivot_epoch=5, port=5315, num_clients=10000, num_workers=100, device='cuda', num_devices=1, share_ps_gpu=True, do_iid=False, train_dataloader_workers=0, val_dataloader_workers=0, model_checkpoint='gpt2', num_candidates=2, max_history=2, local_batch_size=-1, valid_batch_size=8, microbatch_size=-1, lm_coef=1.0, mc_coef=1.0, max_grad_norm=10.0, personality_permutations=1, eval_before_start=False, do_dp=False, dp_mode='worker', l2_norm_clip=1.0, noise_multiplier=0.0) 50000 13 Using BatchNorm: False Finished initializing in 1.40 seconds HACK STEP WARNING: LR is 0 /home/user/CommEfficient-master/CommEfficient/utils.py:258: UserWarning: This overload of add_ is deprecated: add_(Number alpha, Tensor other) Consider using one of the following signatures instead: add_(Tensor other, *, Number alpha) (Triggered internally at /opt/conda/conda-bld/pytorch_1656352645774/work/torch/csrc/utils/python_arg_parser.cpp:1174.) grad_vec.add_(args.weight_decay / args.num_workers, weights) WARNING: LR is 0 epoch lr train_time train_loss train_acc test_loss test_acc down (MiB) up (MiB) total_time 1 0.0800 130.7500 2.2992 0.1194 2.2792 0.1104 64929 1907 134.3366 2 0.1600 128.0034 2.1600 0.1880 1.9858 0.2506 117816 1907 264.4537 3 0.2400 119.8669 2.0835 0.2073 2.4260 0.1237 125085 1907 386.4207 4 0.3200 117.9984 2.3528 0.1367 2.3253 0.1243 131643 1907 506.5437 5 0.4000 117.7414 2.3391 0.1059 2.3031 0.1077 144526 1907 626.3972 the accuracy seems not correct
Can you change num_cols -> 500000?
I will try, thanks!
The result has some improvement, epoch lr train_time train_loss train_acc test_loss test_acc down (MiB) up (MiB) total_time 1 0.0800 127.2544 2.2991 0.1230 2.2790 0.1240 58051 19073 130.7855 2 0.1600 127.1086 2.1490 0.1968 1.9640 0.2827 102534 19073 260.0100 3 0.2400 118.3132 1.9173 0.2945 1.7883 0.3346 118299 19073 380.4323 4 0.3200 116.6946 1.7728 0.3523 1.7535 0.3253 118763 19073 499.2499 5 0.4000 116.9935 1.7242 0.3734 1.6723 0.4013 120322 19073 618.3396 6 0.3789 117.6542 1.6713 0.4001 1.4073 0.4968 121114 19073 738.1218 7 0.3579 119.8809 1.4616 0.4794 1.2784 0.5414 121111 19073 860.1666 8 0.3368 120.2340 1.2608 0.5548 1.1843 0.6088 117029 19073 982.5226 9 0.3158 119.8844 1.1325 0.6084 0.9695 0.6726 115286 19073 1104.5195 10 0.2947 118.9704 1.0010 0.6570 0.9692 0.6685 116108 19073 1225.6146 11 0.2737 117.5381 0.9638 0.6698 0.8361 0.7195 116691 19073 1345.2715 12 0.2526 119.0444 0.8628 0.7088 0.8417 0.7188 117465 19073 1466.4860 13 0.2316 121.0632 0.7987 0.7293 0.7648 0.7514 117877 19073 1589.6690 14 0.2105 118.3240 0.7337 0.7524 0.7069 0.7697 119160 19073 1710.1443 15 0.1895 114.6481 0.7083 0.7597 0.7099 0.7623 118396 19073 1826.9046
If I want more improvements, try num_rows->5? Thank you very much!
try this
bash submit_cifar.sh CIFAR10 ResNet9 fedavg 1000 10 -1 none 24 5 0.2 0 0.9 1 50 0 0 50026 21 1 1 1 0 A 0 0 0 0 1 -1 worker --malicious --iid
Hello, sorry to bother you, I also used the parameters you gave, but I reported the following error, how to solve it
It seems that the error is occurring because the labs we are passing in is just an integer denoting the class label and for some reason the cuda kernel doesn't work with ints? That's pretty weird. What's your torch version and cuda version? Can you print out the types of the inputs in the backward pass? Can you try just casting the label to a torch data type?
My CUDA version is 12.0 and my torch version is 2.1.1
Add a line of code to solve the problem:Add a line of code to solve the problem: