Stas Bekman comments

Results 664 comments of


                                            Stas Bekman

[deepspeed zero3] need `generate(synced_gpus=True, ...)`

That sounds good. Thank you for proposing it, Sylvain. So no warning needed, right? As this logic is really about dynamic default setting and it'll be documented as such.

[deepspeed zero3] need `generate(synced_gpus=True, ...)`

Thank you for suggesting a more elegant solution than my initial one, Sylvain.

Set FSDP `transformer_layer_cls_to_wrap` to `model._no_split_modules` ?

Unfortunately I don't have experience with FSDP to contribute to this discussion.

Error in get embedding_size.

With zero-3 outside of fwd/bwd logic where this is done automatically you need to manually gather the sharded model's weights that you need. Please see: https://huggingface.co/docs/transformers/main/main_classes/deepspeed#gathering-parameters And you will find...

Size of saved model checkpoints after trainer.train() is much larger when using trainer with deepspeed stage2

deepspeed saves the optimizer states as well as fp32 master weights, so of course the checkpoint folder is larger. look at the contents of the saved checkpoint folder. I'm not...

Size of saved model checkpoints after trainer.train() is much larger when using trainer with deepspeed stage2

no, they are saved in their own files under `global_step*`. You might want to inspect the contents of the folder. Please feel free report the full listing and their sizes...

Size of saved model checkpoints after trainer.train() is much larger when using trainer with deepspeed stage2

oh, thank you! now that you're showing the actual file sizes, it's much easier to see what you're talking about. Indeed this looks wrong. I have seen this happening in...

Size of saved model checkpoints after trainer.train() is much larger when using trainer with deepspeed stage2

Wonderful. It was fixed in PP saving code in Deepspeed at https://github.com/microsoft/DeepSpeed/pull/1324 when I first seen this problem in Megatron-Deepspeed a year ago. So probably need to do the same...

Size of saved model checkpoints after trainer.train() is much larger when using trainer with deepspeed stage2

actually, it will require a bit of efficiency changes to it. PP was already having small `state_dict` so it wasn't a problem to clone tensors in small groups. But here...

Size of saved model checkpoints after trainer.train() is much larger when using trainer with deepspeed stage2

hmm, but deepspeed doesn't do checkpoint sharding, those shards come from `transformers`: ``` 32K test2/pytorch_model.bin.index.json 9.2G test2/pytorch_model-00001-of-00003.bin 9.3G test2/pytorch_model-00002-of-00003.bin 6.7G test2/pytorch_model-00003-of-00003.bin ``` So I am actually not sure that the...