Sourab Mangrulkar

Results 236 comments of Sourab Mangrulkar

For other launchers, could you try having hostfile with the below content. Also, are you able to ssh from one machine to another (successfully running `ssh genca1002` from genca1001 node)?...

> I have waited for nearly 30 minutes but there has been no change in the output or the GPU utilization. Is there GPU utilization along with GPU memory usage?...

Strange that standard launcher is throwing NCCL errors but other DeepSpeed launchers are working fine 😅

> > I have waited for nearly 30 minutes but there has been no change in the output or the GPU utilization. > > Is there GPU utilization along with...

Yes, I also observed no progress bar being loaded in multi-node setup and hence I resorted to printing loss every `n` steps. When I killed the process on the second...

> No, I just ran the code with the command `NCCL_IB_GID_INDEX=3 NCCL_DEBUG=INFO accelerate launch script.py` > > This is the output on first machine: > > ``` > Genc-A100-VM:36759:36970 [0]...

Hello @Aaryan369 , in wandb GPU utilisation is 100% which should mean training is happening. Did you have print statements as I did to debug as I feel wandb metrics...

Hello, This is a known issue similar to `Fairscale` integration of Trainer wherein `predict_with_generate` isn't supported by FSDP. This is already planned to be mentioned as known caveats in the...

Hello, Spent a major part of today diving deep into this. Observing very weird behaviour but got a small script to work. 1. The code (dist_gen.py) is below: ```python import...

Hello @Dahoas, shared embedding layers should belong to the same FSDP unit and `sized_based_wrap` puts them in different units leading to an error. Hence, for transformers `TRANSFORMER_BASED_WRAP` should be used....