Stas Bekman

Results 664 comments of Stas Bekman

That sounds good. Thank you for proposing it, Sylvain. So no warning needed, right? As this logic is really about dynamic default setting and it'll be documented as such.

Thank you for suggesting a more elegant solution than my initial one, Sylvain.

Unfortunately I don't have experience with FSDP to contribute to this discussion.

With zero-3 outside of fwd/bwd logic where this is done automatically you need to manually gather the sharded model's weights that you need. Please see: https://huggingface.co/docs/transformers/main/main_classes/deepspeed#gathering-parameters And you will find...

deepspeed saves the optimizer states as well as fp32 master weights, so of course the checkpoint folder is larger. look at the contents of the saved checkpoint folder. I'm not...

no, they are saved in their own files under `global_step*`. You might want to inspect the contents of the folder. Please feel free report the full listing and their sizes...

oh, thank you! now that you're showing the actual file sizes, it's much easier to see what you're talking about. Indeed this looks wrong. I have seen this happening in...

Wonderful. It was fixed in PP saving code in Deepspeed at https://github.com/microsoft/DeepSpeed/pull/1324 when I first seen this problem in Megatron-Deepspeed a year ago. So probably need to do the same...

actually, it will require a bit of efficiency changes to it. PP was already having small `state_dict` so it wasn't a problem to clone tensors in small groups. But here...

hmm, but deepspeed doesn't do checkpoint sharding, those shards come from `transformers`: ``` 32K test2/pytorch_model.bin.index.json 9.2G test2/pytorch_model-00001-of-00003.bin 9.3G test2/pytorch_model-00002-of-00003.bin 6.7G test2/pytorch_model-00003-of-00003.bin ``` So I am actually not sure that the...