ColossalAI
ColossalAI copied to clipboard
[BUG]: start titan example too slow
🐛 Describe the bug
when run titan demo on 3090 x 8 gpu , it start too slow in about 2- minutes; GPU 100% usage;
pipeline=4; tp=2;
Environment
CUDA=11.3 pyton=3.8 pytorch=1.12.1
Can you please share the command you ran?
Can you please share the command you ran?
python -m torch.distributed.launch --nproc_per_node 8 --nnodes 2 --node_rank 0 --master_addr 10.19.102.26 --master_port 29500 train_gpt.py --config ./configs/gpt2_small_zero3_pp1d.py --from_torch $DUMMY_DATA
Please set --nproc_per_node
to be 4
.
Please set
--nproc_per_node
to be4
.
could you please explain relations between TP, PP, nproc_per_node , global_batch_size, micro_batch ? thanks
They are referring to vastly different concepts lol. What are you confused over? Can you give me some concrete failure samples? Otherwise, I'd suggest our official guide here.
They are referring to vastly different concepts lol. What are you confused over? Can you give me some concrete failure samples? Otherwise, I'd suggest our official guide here.
no error msg, but start up training too slow, need 20minutes to start training.