Stas Bekman

Results 664 comments of Stas Bekman

Very interesting. I have never seen such behavior before. I wasn't part of the Deepspeed integration at accelerate so you probably need to ask there. The HF Trainer integration works...

> > 2. the other question is why when the model was allocated via `zero.Init` w/o offload it consumes 10GB of CPU memory and not close to 1GB? I bracketed...

> This is due to unnecessary zero stage 3 memory allocation that is exposed because of the shared code base. To address this, I have embarked on the **trivial** task...

Hi Adam, Indeed, we have finished training 176B, so hopefully this version will accept your work. In the case of JeanZay from my many experiments IO seems to be the...

There is one more dimension to this design discussion - and it's whether the additional accumulator is sharded or not. e.g. currently bf16 optimizer allocates a local accumulator on each...

BF16Optimizer is ZeRO stage 1, but currently it's a bit of a hack and thus uses stage=0, it's just differently implemented so can't be used as normal stage-1 - this...

> We already have examples for running for some transformer networks. For this argument, I think you might just add **local_rank** to your parser arguments the same as [here](https://github.com/microsoft/DeepSpeedExamples/blob/20ea07a2a069696abec212e25476a9bf76aced70/bing_bert/utils.py#L51-L54). This...

And the error is right there in your report: https://github.com/microsoft/DeepSpeed/issues/889#issuecomment-806526657 > c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda-10.2/include -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include/TH -isystem /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages/torch/include/THC...

This sounds like a permission issue. Try to set `TMPDIR` to another dir that is writable by you? e.g.: ``` mkdir ~/tmp export TMPDIR=~/tmp ... do the build here ......

Yes, once bf16/z0 PR is merged we can look at fp16/z0 next. The other approach is to: 1. start with a random optim states 2. run for some steps with...