CommEfficient Question about the final test

Hi, I tried to reproduce the experiment results in the paper. I am using the following commands:

python cv_train.py --dataset_name CIFAR10 --model ResNet9 --mode fedavg --num_clients 200 --num_workers 10 --num_rows 1 --num_cols 50000 --error_type none --local_momentum 0.0 --virtual_momentum 0.9 --max_grad_norm 1.0 --num_devices=1 --lr_scale 0 --local_batch_size -1 --share_ps_gpu

the accuracy seems not correct, could you help me solve the problem? the logs are:

MY PID: 3424 Namespace(do_test=False, 50000 125 Using BatchNorm: False grad size 6568640 Finished initializing in 1.91 seconds epoch lr 1 0.0800 25.2243 2 0.1600 23.8377 3 0.2400 24.0562 4 0.3200 23.0985 5 0.4000 22.0071 6 0.3789 21.9321 7 0.3579 21.9460 8 0.3368 21.8156 9 0.3158 21.6892 10 0.2947 21.9432 11 0.2737 21.9095 12 0.2526 27.3621 13 0.2316 37.6449 14 0.2105 32.6825 15 0.1895 22.3018 16 0.1684 30.7159 17 0.1474 22.4368 18 0.1263 39.1600 19 0.1053 38.9138 20 0.0842 21.9821 21 0.0632 32.8850 22 0.0421 38.2427 23 0.0211 38.3396 HACK STEP WARNING: LR is 0 WARNING: LR is 0 24 0.0000 33.1543 done training mode='fedavg', robustagg='none', use_tensorboard=False, seed=21, model='ResNet9', do_finetune=False, do_dp_finetune=False, do_checkpoint=False, checkpoint_path='/data/nvme/ashwinee/CommEfficient/CommEfficient/checkpoints/', finetune_path='./finetune', finetuned_from=None, num_results_train=2, num_results_val=2, dataset_name='CIFAR10', dataset_dir='./dataset', do_batchnorm=False, nan_threshold=999, k=50000, num_cols=50000, num_rows=1, num_blocks=20, do_topk_down=False, local_momentum=0.0, virtual_momentum=0.9, weight_decay=0.0005, num_epochs=24, num_fedavg_epochs=1, fedavg_batch_size=-1, fedavg_lr_decay=1.0, error_type='none', lr_scale=0.4, pivot_epoch=5, port=5315, num_clients=200, num_workers=10, device='cuda', num_devices=1, share_ps_gpu=True, do_iid=False, train_dataloader_workers=0, val_dataloader_workers=0, model_checkpoint='gpt2', num_candidates=2, max_history=2, local_batch_size=-1, valid_batch_size=8, microbatch_size=-1, lm_coef=1.0, mc_coef=1.0, max_grad_norm=1.0, personality_permutations=1, eval_before_start=False, checkpoint_epoch=-1, finetune_epoch=12, do_malicious=False, mal_targets=1, mal_boost=1.0, mal_epoch=0, mal_type=None, do_mal_forecast=False, do_pgd=False, do_data_ownership=False, mal_num_clients=-1, layer_freeze_idx=0, mal_layer_freeze_idx=0, mal_num_epochs=1, backdoor=-1, do_perfect_knowledge=False, do_dp=False, dp_mode='worker', l2_norm_clip=1.0, noise_multiplier=0.0, client_lr=0.1) train_time train_loss train_acc test_loss test_acc total_time 2.3028 0.1038 2.3012 0.1405 30.4418 2.3025 0.1017 2.2936 0.1460 57.5426 2.2886 0.1157 2.2449 0.1507 84.8176 2.2479 0.1461 2.1887 0.1535 111.1938 2.2901 0.0944 2.2941 0.0930 136.4487 2.3150 0.1301 3.3015 0.0997 161.6546 2.3782 0.1078 2.2771 0.1324 186.8818 2.2793 0.1264 2.2281 0.1360 211.9292 2.2410 0.1775 2.2307 0.1417 236.9210 2.2989 0.1024 2.2831 0.1175 262.0983 2.2511 0.1332 2.1657 0.1901 287.2876 2.1729 0.1771 2.1231 0.1734 321.2075 2.1274 0.1580 2.1067 0.2008 365.1934 2.3116 0.1308 2.0721 0.2026 401.1535 2.1435 0.1707 2.0014 0.2332 426.7760 2.0729 0.1982 2.1173 0.2312 460.7642 2.1110 0.2006 2.0027 0.2580 489.7420 2.0538 0.1897 2.0412 0.2377 535.3520 2.0614 0.2156 2.0193 0.2655 580.7346 1.9763 0.2441 2.0301 0.2679 605.9769 1.9892 0.2655 2.0524 0.2711 645.6084 1.9478 0.2627 1.8612 0.3094 690.2626 1.9010 0.2778 1.8869 0.2993 735.1110 1.9016 0.2929 1.8394 0.3032 771.5566

Nov 16 '23 09:11 somethingsong

Hi, I'm not sure exactly what experimental results you're trying to reproduce. I'm not sure whether we had any results in the paper that have mode=fedavg, num_clients=200, num_workers=10, local_batch_size=1. Could you tell me what results you're trying to reproduce so I can tell you what hparams to use? Thanks.

Nov 16 '23 18:11 kiddyboots216

Thank you very much for your response. I am trying to reproduce the results: 2023-11-17_12-23 2023-11-17_12-24 Could you please tell me what hparams to use? I genuinely appreciate your assistance.

Nov 17 '23 04:11 somethingsong

python cv_train.py --dataset_name CIFAR10 --model ResNet9 --mode sketch --num_clients 10000 --num_workers 100 --num_rows 1 --num_cols 50000 --error_type virtual --local_momentum 0.0 --virtual_momentum 0.9 --max_grad_norm 10.0 --num_devices=1 --lr_scale 0.4 --local_batch_size -1 --share_ps_gpu

Nov 17 '23 06:11 kiddyboots216

Thanks very much! I will try it now.

Nov 17 '23 06:11 somethingsong

I used the hparams you give, but I got the results: MY PID: 9007 Namespace(do_test=False, mode='sketch', use_tensorboard=False, seed=21, model='ResNet9', do_finetune=False, do_checkpoint=False, checkpoint_path='./checkpoint', finetune_path='./finetune', finetuned_from=None, num_results_train=2, num_results_val=2, dataset_name='CIFAR10', dataset_dir='./dataset', do_batchnorm=False, nan_threshold=999, k=50000, num_cols=50000, num_rows=1, num_blocks=20, do_topk_down=False, local_momentum=0.0, virtual_momentum=0.9, weight_decay=0.0005, num_epochs=24, num_fedavg_epochs=1, fedavg_batch_size=-1, fedavg_lr_decay=1, error_type='virtual', lr_scale=0.4, pivot_epoch=5, port=5315, num_clients=10000, num_workers=100, device='cuda', num_devices=1, share_ps_gpu=True, do_iid=False, train_dataloader_workers=0, val_dataloader_workers=0, model_checkpoint='gpt2', num_candidates=2, max_history=2, local_batch_size=-1, valid_batch_size=8, microbatch_size=-1, lm_coef=1.0, mc_coef=1.0, max_grad_norm=10.0, personality_permutations=1, eval_before_start=False, do_dp=False, dp_mode='worker', l2_norm_clip=1.0, noise_multiplier=0.0) 50000 13 Using BatchNorm: False Finished initializing in 1.40 seconds HACK STEP WARNING: LR is 0 /home/user/CommEfficient-master/CommEfficient/utils.py:258: UserWarning: This overload of add_ is deprecated: add_(Number alpha, Tensor other) Consider using one of the following signatures instead: add_(Tensor other, *, Number alpha) (Triggered internally at /opt/conda/conda-bld/pytorch_1656352645774/work/torch/csrc/utils/python_arg_parser.cpp:1174.) grad_vec.add_(args.weight_decay / args.num_workers, weights) WARNING: LR is 0 epoch lr train_time train_loss train_acc test_loss test_acc down (MiB) up (MiB) total_time 1 0.0800 130.7500 2.2992 0.1194 2.2792 0.1104 64929 1907 134.3366 2 0.1600 128.0034 2.1600 0.1880 1.9858 0.2506 117816 1907 264.4537 3 0.2400 119.8669 2.0835 0.2073 2.4260 0.1237 125085 1907 386.4207 4 0.3200 117.9984 2.3528 0.1367 2.3253 0.1243 131643 1907 506.5437 5 0.4000 117.7414 2.3391 0.1059 2.3031 0.1077 144526 1907 626.3972 the accuracy seems not correct

Nov 17 '23 06:11 somethingsong

Can you change num_cols -> 500000?

Nov 17 '23 06:11 kiddyboots216

I will try, thanks!

Nov 17 '23 06:11 somethingsong

The result has some improvement, epoch lr train_time train_loss train_acc test_loss test_acc down (MiB) up (MiB) total_time 1 0.0800 127.2544 2.2991 0.1230 2.2790 0.1240 58051 19073 130.7855 2 0.1600 127.1086 2.1490 0.1968 1.9640 0.2827 102534 19073 260.0100 3 0.2400 118.3132 1.9173 0.2945 1.7883 0.3346 118299 19073 380.4323 4 0.3200 116.6946 1.7728 0.3523 1.7535 0.3253 118763 19073 499.2499 5 0.4000 116.9935 1.7242 0.3734 1.6723 0.4013 120322 19073 618.3396 6 0.3789 117.6542 1.6713 0.4001 1.4073 0.4968 121114 19073 738.1218 7 0.3579 119.8809 1.4616 0.4794 1.2784 0.5414 121111 19073 860.1666 8 0.3368 120.2340 1.2608 0.5548 1.1843 0.6088 117029 19073 982.5226 9 0.3158 119.8844 1.1325 0.6084 0.9695 0.6726 115286 19073 1104.5195 10 0.2947 118.9704 1.0010 0.6570 0.9692 0.6685 116108 19073 1225.6146 11 0.2737 117.5381 0.9638 0.6698 0.8361 0.7195 116691 19073 1345.2715 12 0.2526 119.0444 0.8628 0.7088 0.8417 0.7188 117465 19073 1466.4860 13 0.2316 121.0632 0.7987 0.7293 0.7648 0.7514 117877 19073 1589.6690 14 0.2105 118.3240 0.7337 0.7524 0.7069 0.7697 119160 19073 1710.1443 15 0.1895 114.6481 0.7083 0.7597 0.7099 0.7623 118396 19073 1826.9046

If I want more improvements, try num_rows->5? Thank you very much!

Nov 17 '23 07:11 somethingsong

try this

bash submit_cifar.sh CIFAR10 ResNet9 fedavg 1000 10 -1 none 24 5 0.2 0 0.9 1 50 0 0 50026 21 1 1 1 0 A 0 0 0 0 1 -1 worker --malicious  --iid

Nov 17 '23 07:11 kiddyboots216

Hello, sorry to bother you, I also used the parameters you gave, but I reported the following error, how to solve it

Dec 07 '23 15:12 miliable

It seems that the error is occurring because the labs we are passing in is just an integer denoting the class label and for some reason the cuda kernel doesn't work with ints? That's pretty weird. What's your torch version and cuda version? Can you print out the types of the inputs in the backward pass? Can you try just casting the label to a torch data type?

Dec 07 '23 15:12 kiddyboots216

My CUDA version is 12.0 and my torch version is 2.1.1

Dec 08 '23 08:12 miliable

Add a line of code to solve the problem:Add a line of code to solve the problem:

Dec 11 '23 09:12 miliable

CommEfficient
CommEfficient copied to clipboard

Question about the final test_acc in cifar10 experiment.

CommEfficient CommEfficient copied to clipboard

Question about the final test_acc in cifar10 experiment.

CommEfficient
CommEfficient copied to clipboard