Ashwinee Panda
Ashwinee Panda
> Hi @kiddyboots216 - the warning is thrown from [here ](https://github.com/microsoft/DeepSpeed/blob/78c3b148a8a8b6e60ab77a5c75849961f52b143d/op_builder/builder.py#L341)which is in the op_builder. Can you try, when you install DeepSpeed, running `DS_BUILD_CPU_ADAM=1 pip install deepspeed` so the ops...
> Hi @kiddyboots216 - the warning is thrown from [here ](https://github.com/microsoft/DeepSpeed/blob/78c3b148a8a8b6e60ab77a5c75849961f52b143d/op_builder/builder.py#L341)which is in the op_builder. Can you try, when you install DeepSpeed, running `DS_BUILD_CPU_ADAM=1 pip install deepspeed` so the ops...
Here is an (even more minimal example); this works for '1.3b' but will kill the subprocess for '2.7b' (but I can actually train that 2.7b param model, so I would...
Ok after messing around with some other things and reinstalling (there was some other errors with fusedADAM), I ended up just allocating more CPU memory. It seems that the CPUAdamOps...
> Interesting, thanks for the info @kiddyboots216 - could you share how much memory you needed? I ended up needing 300G of CPU RAM! Which seemed quite high to me....
Hello. Could you try using the settings that we use in the paper? So don't add the --iid flag and use the number of workers and number of clients that...
Hi @Antonio-demo I think you can create a new issue and provide some more details, e.g. the command that you are running.
Hi, I'm not sure exactly what experimental results you're trying to reproduce. I'm not sure whether we had any results in the paper that have mode=fedavg, num_clients=200, num_workers=10, local_batch_size=1. Could...
python cv_train.py --dataset_name CIFAR10 --model ResNet9 --mode sketch --num_clients 10000 --num_workers 100 --num_rows 1 --num_cols 50000 --error_type virtual --local_momentum 0.0 --virtual_momentum 0.9 --max_grad_norm 10.0 --num_devices=1 --lr_scale 0.4 --local_batch_size -1 --share_ps_gpu
Can you change num_cols -> 500000?