Sylvain Gugger

Results 631 comments of Sylvain Gugger

You can use DDP if your model is only on one device like this.

Then you cannot use DDP + `device_map="auto"`. You need to use DeepSpeed or FSDP.

I feel like you are not listening. You cannot use `DDP + device_map="auto" ` and thus not `DDP + device_map="auto" + DeepSpeep` either. You need to just use DeepSpeed ZeRO-3...

As long as you properly configure DeepSpeed ZeRO-3, you won't need to use `device_map="auto"` yes, and the model will be loaded on several GPUs (each weight will be split).

You can use the `dispatch_batches=True` option and only load your dataset in the process 0 (loading something with the same length but no real samples in them in the other...

The machine you are using lacks the necessary amount of RAM to load the model in the 4 processes at the same time (you need 4 times the amount of...

Yes for DDP you need 4x the size of the model if you have 4 GPUs in CPU RAM. Not that training with Adam usually requires 4x the size of...

The sharding stategy is optimizer only though, from what I see in your Training arguments. You need to shard the model as well (cc @pacman100 who will know more on...