fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

Why FSDP is much slower than normal data Parallel?

Open SefaZeng opened this issue 3 years ago • 5 comments

❓ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

I change the ddp-backend from no_c10d to fully_sharded as it is said to be more efficient. But the training speed is much slower than the origin one. It took 5 minutes to train 100 steps in the origin ddp-backend, but it will take 18 minutes when using FSDP.

Code

In ddp-backend=no_c10d:

2022-06-28 16:11:33 | INFO | train_inner | epoch 001:   5639 / 42117 loss=5.111, nll_loss=3.228, ppl=9.37, wps=156839, ups=0.35, wpb=448862, bsz=20586, num_updates=5600, lr=0.000422577, gnorm=0.188, loss_scale=8, train_wall=276, gb_free=11.6, wall=17835
2022-06-28 16:16:20 | INFO | train_inner | epoch 001:   5739 / 42117 loss=5.105, nll_loss=3.221, ppl=9.32, wps=156115, ups=0.35, wpb=449261, bsz=20598.5, num_updates=5700, lr=0.000418854, gnorm=0.191, loss_scale=8, train_wall=278, gb_free=12.1, wall=18123
2022-06-28 16:21:05 | INFO | train_inner | epoch 001:   5839 / 42117 loss=5.082, nll_loss=3.196, ppl=9.17, wps=157603, ups=0.35, wpb=448412, bsz=20614.2, num_updates=5800, lr=0.000415227, gnorm=0.188, loss_scale=16, train_wall=276, gb_free=12.6, wall=18407

While in ddp-backend=fully_sharded:

2022-07-01 05:53:19 | INFO | train_inner | epoch 001:  41351 / 42117 loss=4.451, nll_loss=2.52, ppl=5.74, wps=41767.9, ups=0.09, wpb=451067, bsz=20620.8, num_updates=1100, lr=0.000137573, gnorm=0.092, loss_scale=16, train_wall=1070, gb_free=21, wall=0
2022-07-01 06:11:27 | INFO | train_inner | epoch 001:  41451 / 42117 loss=4.458, nll_loss=2.528, ppl=5.77, wps=41119.9, ups=0.09, wpb=447612, bsz=20585.2, num_updates=1200, lr=0.00015007, gnorm=0.094, loss_scale=16, train_wall=1079, gb_free=21.2, wall=0
2022-07-01 06:29:30 | INFO | train_inner | epoch 001:  41551 / 42117 loss=4.453, nll_loss=2.523, ppl=5.75, wps=41404.2, ups=0.09, wpb=448513, bsz=20568.7, num_updates=1300, lr=0.000162568, gnorm=0.098, loss_scale=32, train_wall=1074, gb_free=22, wall=0

What have you tried?

What's your environment?

  • fairseq Version (e.g., 1.0 or main): main
  • PyTorch Version (e.g., 1.0) 1.10
  • OS (e.g., Linux):
  • How you installed fairseq (pip, source):
  • Build command you used (if compiling from source):
  • Python version: 3.8
  • CUDA/cuDNN version: 11.3
  • GPU models and configuration: A100
  • Any other relevant information:

SefaZeng avatar Jul 01 '22 08:07 SefaZeng

How do you conduct the parallel excution?

I am not so sure but it seems that qsub results a slower excution, while qrsh makes fsdp and no_c10d the same speed.

gmryu avatar Jul 01 '22 13:07 gmryu

How do you conduct the parallel excution?

I am not so sure but it seems that qsub results a slower excution, while qrsh makes fsdp and no_c10d the same speed.

Thanks for your reply. This is my scripts:

OMP_NUM_THREADS=20 \
TORCH_EXTENSIONS_DIR=$work_dir/torch_extension_a100 \
/opt/anaconda3/bin/python -m torch.distributed.launch --nproc_per_node=8 \
 --nnodes=${WORLD_SIZE} --node_rank=${RANK} --master_addr=$MASTER_ADDR \
 --master_port=$MASTER_PORT \
  $fairseq_dir/train.py \
  $data \
 --task translation_multi_simple_epoch \
 --sampling-method "temperature" \
 --sampling-temperature 1.5 \
 --encoder-langtok "tgt" \
 --langs "$lang_list" \
 --lang-pairs "$lang_pairs" \
 --save-dir $output_dir \
 --arch transformer \
 --attention-dropout 0.1 \
 --activation-dropout 0.1 \
 --dropout 0.1 \
 --encoder-layers 12 \
 --decoder-layers 12 \ 
 --encoder-embed-dim 1024 \
 --decoder-embed-dim 1024 \
 --encoder-attention-heads 16 \
 --encoder-attention-heads 16 \
 --encoder-ffn-embed-dim 4096 \
 --decoder-ffn-embed-dim 4096 \
 --optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-8 --clip-norm 0.0 \
 --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
 --warmup-init-lr 1e-07 \
 --weight-decay 0.0001 \
 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
 --max-tokens 4096 \
 --update-freq 8 \ 
 --log-interval 100 \
 --save-interval-updates 10000 \
 --skip-invalid-size-inputs-valid-test \
 --save-interval 100000000000 \
 --num-workers 1 \ 
 --seed 1017 \
 --fp16  \
 --ddp-backend=no_c10d  # change to fully_sharded if use FSDP

SefaZeng avatar Jul 04 '22 01:07 SefaZeng

No, it is not this script. It is one step before this excution.

Since there is a nnodes there, are you using some cloud service for computation? Then how do you submit your job? How do you link nodes so as to work together? It feels like using openmpi slows down fsdp, while a direct ssh communication will keep fsdp the same speed as no_c10d

gmryu avatar Jul 04 '22 10:07 gmryu

No, it is not this script. It is one step before this excution.

Since there is a nnodes there, are you using some cloud service for computation? Then how do you submit your job? How do you link nodes so as to work together? It feels like using openmpi slows down fsdp, while a direct ssh communication will keep fsdp the same speed as no_c10d

Yes, I am using an in-house platform to run the training jobs for deep learning. I need to set how many GPUs and memory I need and give a start script like the above, and parameters like WORLD_SIZE, RANK, and ADDR is set by the platform.
About the openmpi, do you mean OMP_NUM_THREADS=20 will slow down the FSDP?

SefaZeng avatar Jul 05 '22 02:07 SefaZeng

It is not OMP, openmp is used inside one node. MPI is what used between nodes communication.

How do you submit job to your in-house platform? Are you connecting to each node by yourself and run torch.distributed.launch ? or you use a command like "mpirun" to command nodes to excute torch.distributed.launch?

What I experienced is, I used mpirun to call nodes to excute torch.distributed... -> very slow. Then, I directly ssh each node and excute torchrun -> acceptable speed ( as 80% fast as no_c10d, it also depends on your optimizer, cpu_adam was faster in my case)

A different point, use torchrun --nproc_per_node=$NPROC_PER_NODE --nnodes=$NNODES --master_addr=$MASTER_HOST instead. They are basically the same but torchrun is newer.

--

If what I wrote above does not ring a bell, I am sorry. I would recommend you to email or @ who implemented FSDP and ask them which platform do they use to test the code.

gmryu avatar Jul 07 '22 10:07 gmryu