Sylvain Gugger comments

Results 631 comments of


                                            Sylvain Gugger

lr_scheduler step once in code, but in every process lr_scheduler step 4 times (when using 4 gpus) why?

That's because with 4 GPUs you have a batch size 4 times bigger so a number of total training steps 4 times smaller.

lr_scheduler step once in code, but in every process lr_scheduler step 4 times (when using 4 gpus) why?

If you account for everything yourself, then you don't need to use Accelerate :-)

Error with transformers 4.28.1

That error usually comes from a borked install of PyTorch. You should try to re-install it.

Problem with Webdataset

cc @muellerzr

Encountering raise ValueError("Integer parameters are unsupported") when using FSDP and load_in_8bit=True

cc @younesbelkada

Setting NVMe in DeepSpeed ZeRO with accelerate config, but nowhere to specify NVMe path

cc @pacman100

Multi-GPU setup overloads a single GPU instead of distributing the load

It's hard to know without knowing the script you run, but it's very likely that you do not have enough RAM to load the model on the 2 processes: each...

Feature request - SLURM support

Yes I'm sur many at fair use it since it's a facebookincubator project. It remains that the last commit is 6 months old. I see an issue opened 6 months...

FP8 training causes OOM

Indeed, the linear layer needs to be created with the same dtype as the original one. Would you like to suggest a PR with a fix?

FP8 training causes OOM

Yes, you can definitely open a PR with this fix.