ColossalAI [BUG]: start titan example too slow

[BUG]: start titan example too slow

Open joan126 opened this issue 1 year ago • 6 comments

🐛 Describe the bug

when run titan demo on 3090 x 8 gpu , it start too slow in about 2- minutes; GPU 100% usage;

pipeline=4; tp=2;

Environment

CUDA=11.3 pyton=3.8 pytorch=1.12.1

Mar 07 '23 10:03 joan126

Can you please share the command you ran?

Mar 07 '23 12:03 JThh

Can you please share the command you ran?

python -m torch.distributed.launch --nproc_per_node 8 --nnodes 2 --node_rank 0 --master_addr 10.19.102.26 --master_port 29500 train_gpt.py --config ./configs/gpt2_small_zero3_pp1d.py --from_torch $DUMMY_DATA

Mar 08 '23 05:03 superhg

Please set --nproc_per_node to be 4.

Mar 08 '23 09:03 JThh

Please set --nproc_per_node to be 4.

could you please explain relations between TP, PP, nproc_per_node , global_batch_size, micro_batch ? thanks

Mar 08 '23 09:03 joan126

They are referring to vastly different concepts lol. What are you confused over? Can you give me some concrete failure samples? Otherwise, I'd suggest our official guide here.

Mar 08 '23 10:03 JThh

They are referring to vastly different concepts lol. What are you confused over? Can you give me some concrete failure samples? Otherwise, I'd suggest our official guide here.

no error msg, but start up training too slow, need 20minutes to start training.

Mar 08 '23 10:03 joan126

ColossalAI ColossalAI copied to clipboard

[BUG]: start titan example too slow

🐛 Describe the bug

Environment

ColossalAI
ColossalAI copied to clipboard