ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

[BUG]: start titan example too slow

Open joan126 opened this issue 1 year ago • 6 comments

🐛 Describe the bug

when run titan demo on 3090 x 8 gpu , it start too slow in about 2- minutes; GPU 100% usage;

pipeline=4; tp=2;

Environment

CUDA=11.3 pyton=3.8 pytorch=1.12.1

joan126 avatar Mar 07 '23 10:03 joan126

Can you please share the command you ran?

JThh avatar Mar 07 '23 12:03 JThh

Can you please share the command you ran?

python -m torch.distributed.launch --nproc_per_node 8 --nnodes 2 --node_rank 0 --master_addr 10.19.102.26 --master_port 29500 train_gpt.py --config ./configs/gpt2_small_zero3_pp1d.py --from_torch $DUMMY_DATA

superhg avatar Mar 08 '23 05:03 superhg

Please set --nproc_per_node to be 4.

JThh avatar Mar 08 '23 09:03 JThh

Please set --nproc_per_node to be 4.

could you please explain relations between TP, PP, nproc_per_node , global_batch_size, micro_batch ? thanks

joan126 avatar Mar 08 '23 09:03 joan126

They are referring to vastly different concepts lol. What are you confused over? Can you give me some concrete failure samples? Otherwise, I'd suggest our official guide here.

JThh avatar Mar 08 '23 10:03 JThh

They are referring to vastly different concepts lol. What are you confused over? Can you give me some concrete failure samples? Otherwise, I'd suggest our official guide here.

no error msg, but start up training too slow, need 20minutes to start training.

joan126 avatar Mar 08 '23 10:03 joan126