Jiatong (Julius) Han
Jiatong (Julius) Han
This example does not support TP yet. Have you tried `colossalai_gemini` strategy and set placement to be `cuda`?
May I know when OOM happened? Was it after the model init or the start of first epoch training?
With the same strategy, how about setting placement to be ‘cpu’? Some user reported it worked.
PP is not quite applicable and testified yet in this scenario yet. Have you tested out ddp strategy, without colossalai?
Plus, can you use torch profiler to track your memory usage and then we can know which step caused oom?
Sorry @loveJasmine, currently we only support GPU with compute capacity >= 7.0, as said [here](https://github.com/hpcaitech/ColossalAI#installation).
I think it is due to a mismatch of your nvidia runtime (`ubuntu`) with your os environment `debian`. You might want to alter either of these to match.
The reason is `/usr/bin/supervisord` is taken as the name of the variable instead of values to export. It happens in the line:`cd /root && export ="/usr/bin/supervisord"`. There is redundant empty...
Can you try `torchrun` command directly and see if the error persists?
There was no error but it hanged? Or did it run normally?