Jiatong (Julius) Han comments

Results 199 comments of


                                            Jiatong (Julius) Han

trafficstars

torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 6 (pid: 1268782) of binary: /usr/bin/python3[BUG]:

This example does not support TP yet. Have you tried `colossalai_gemini` strategy and set placement to be `cuda`?

torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 6 (pid: 1268782) of binary: /usr/bin/python3[BUG]:

May I know when OOM happened? Was it after the model init or the start of first epoch training?

torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 6 (pid: 1268782) of binary: /usr/bin/python3[BUG]:

With the same strategy, how about setting placement to be ‘cpu’? Some user reported it worked.

torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 6 (pid: 1268782) of binary: /usr/bin/python3[BUG]:

PP is not quite applicable and testified yet in this scenario yet. Have you tested out ddp strategy, without colossalai?

torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 6 (pid: 1268782) of binary: /usr/bin/python3[BUG]:

Plus, can you use torch profiler to track your memory usage and then we can know which step caused oom?

how to run it with 1080ti/P40, namely CC is 6.1

Sorry @loveJasmine, currently we only support GPU with compute capacity >= 7.0, as said [here](https://github.com/hpcaitech/ColossalAI#installation).

[BUG]: /bin/bash: line 0: export: `=/usr/bin/supervisord': not a valid identifier Error: failed to run torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py --use_trainer on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!

I think it is due to a mismatch of your nvidia runtime (`ubuntu`) with your os environment `debian`. You might want to alter either of these to match.

[BUG]: /bin/bash: line 0: export: `=/usr/bin/supervisord': not a valid identifier Error: failed to run torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py --use_trainer on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!

The reason is `/usr/bin/supervisord` is taken as the name of the variable instead of values to export. It happens in the line:`cd /root && export ="/usr/bin/supervisord"`. There is redundant empty...

[BUG]: /bin/bash: line 0: export: `=/usr/bin/supervisord': not a valid identifier Error: failed to run torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py --use_trainer on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!

Can you try `torchrun` command directly and see if the error persists?

[BUG]: /bin/bash: line 0: export: `=/usr/bin/supervisord': not a valid identifier Error: failed to run torchrun --nproc_per_node=1 --nnodes=1 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=127.0.0.1:29500 --rdzv_id=colossalai-default-job train.py --use_trainer on 127.0.0.1, is localhost: True, exception: Encountered a bad command exit code!

There was no error but it hanged? Or did it run normally?