Sourab Mangrulkar comments

Results 236 comments of


                                            Sourab Mangrulkar

Multi-node training on 2 A100 machines.

For other launchers, could you try having hostfile with the below content. Also, are you able to ssh from one machine to another (successfully running `ssh genca1002` from genca1001 node)?...

Multi-node training on 2 A100 machines.

> I have waited for nearly 30 minutes but there has been no change in the output or the GPU utilization. Is there GPU utilization along with GPU memory usage?...

Multi-node training on 2 A100 machines.

Strange that standard launcher is throwing NCCL errors but other DeepSpeed launchers are working fine 😅

Multi-node training on 2 A100 machines.

> > I have waited for nearly 30 minutes but there has been no change in the output or the GPU utilization. > > Is there GPU utilization along with...

Multi-node training on 2 A100 machines.

Yes, I also observed no progress bar being loaded in multi-node setup and hence I resorted to printing loss every `n` steps. When I killed the process on the second...

Multi-node training on 2 A100 machines.

> No, I just ran the code with the command `NCCL_IB_GID_INDEX=3 NCCL_DEBUG=INFO accelerate launch script.py` > > This is the output on first machine: > > ``` > Genc-A100-VM:36759:36970 [0]...

Multi-node training on 2 A100 machines.

Hello @Aaryan369 , in wandb GPU utilisation is 100% which should mean training is happening. Did you have print statements as I did to debug as I feel wandb metrics...

caffe2 error in forward method when using fsdp

Hello, This is a known issue similar to `Fairscale` integration of Trainer wherein `predict_with_generate` isn't supported by FSDP. This is already planned to be mentioned as known caveats in the...

caffe2 error in forward method when using fsdp

Hello, Spent a major part of today diving deep into this. Observing very weird behaviour but got a small script to work. 1. The code (dist_gen.py) is below: ```python import...

caffe2 error in forward method when using fsdp

Hello @Dahoas, shared embedding layers should belong to the same FSDP unit and `sized_based_wrap` puts them in different units leading to an error. Hence, for transformers `TRANSFORMER_BASED_WRAP` should be used....