Stas Bekman

Results 664 comments of Stas Bekman

As this problem is recurrent for HF Transformers' users - meanwhile I shared a hack to stagger checkpoint loading for those who need here: https://github.com/huggingface/transformers/issues/17534#issuecomment-1151693075 If you're not using HF...

To make it generic we need some sort of a black-list of keys not to shard configurable by the user then. Probably exact match would be safer. additional info: This...

> However the downside of doing this is then ZeRO does not support loading the checkpoint with a different data parallel (DP) size than what was used to save the...

I have an idea. Instead of having code complexity to make the checkpoint elastic, why not make the checkpoint elastic instead? That is optimize/simplify the code to work with a...

> Stas told me that a blocklist already exists for weights. It's not an explicit listing of names, but by param size: https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training > stage3_param_persistence_threshold: integer > Do not partition...

Does this code also remove the old param that is no longer in the submodule? In the particular case of https://github.com/huggingface/transformers/blob/fbe278c76c56d97df98b5884e6856c168cd2a396/src/transformers/models/m2m_100/modeling_m2m_100.py#L133-L134 where a new param is added during `forward` it's...

new failure with this branch: ``` E File "/mnt/nvme1/code/huggingface/transformers-ds-model-zoo-2/examples/pytorch/translation/run_translation.py", line 620, in E File "/mnt/nvme1/code/huggingface/transformers-ds-model-zoo-2/examples/pytorch/translation/run_translation.py", line 620, in E main()main() E E File "/mnt/nvme1/code/huggingface/transformers-ds-model-zoo-2/examples/pytorch/translation/run_translation.py", line 537, in main E File...

To reproduce: ``` git clone https://github.com/stas00/transformers/ cd transformers git checkout ds-model-zoo-2 RUN_SLOW=1 pytest tests/deepspeed/test_model_zoo.py -k test_zero_to_fp32_zero3_trans_m2m_100 -sv ``` with master it fails with: ``` File "/home/stas/anaconda3/envs/py38-pt110/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1168, in __getattr__...

Additionally I wonder if someone may have a case where they don't replace but remove the pre-existing param and then add a new param with a different name. This is...