NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Slow training on Mixtral-8x22B when DP size > 1

Open sunilitggu opened this issue 10 months ago • 5 comments

Describe the bug

We are in the process of fine-tuning Mixtral-8x22b using an instruction fine-tuning dataset. The model is divided using PP=8 and TP=4. Our experiments are conducted on DGX nodes, each equipped with 8 H100 GPUs. Nodes are interconnected through a 3.2 TbPS InfiniBand.

The model was tested with various DP sizes, such as 1, 2, 4, and 8. Throughout all experiments, we maintained a Micro Batch Size of 1 and a Global Batch Size of 128; gradient accommodation occurs according to DP size.

Here are the average global batch processing times (GBPT):

#nodes=4, DP=1, GBPT = 2 sec #nodes=8, DP=2, GBPT = 10 sec #nodes=16, DP=4, GBPT = 32 sec #nodes=32, DP=8, GBPT = 36 sec

Adding more nodes exponentially increases training time.

Steps/Code to reproduce bug

TRAIN="[/vast/core42-nlp/users/sunil.sahu/dataset/v10p1_llama_temp/train_test/test.jsonl]" VALID="[/vast/core42-nlp/users/sunil.sahu/dataset/v10p1_llama_temp/train_test/test.jsonl]" TEST="[/vast/core42-nlp/users/sunil.sahu/dataset/v10p1_llama_temp/train_test/test.jsonl]" MODEL="/checkpoints/mixtral-8x22b-v0.1/nemo-checkpoints/"

VALID_NAMES="v10p1" CONCAT_SAMPLING_PROBS="[1]"

read -r -d '' cmd <<EOF echo "STARTING*"
&& python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py
trainer.precision=bf16
trainer.devices=8
trainer.num_nodes=16
trainer.val_check_interval=1000
trainer.max_steps=10000
trainer.log_every_n_steps=1
trainer.use_distributed_sampler=False
model.restore_from_path=${MODEL}
model.micro_batch_size=1
model.global_batch_size=128
model.tensor_model_parallel_size=4
model.pipeline_model_parallel_size=8
model.sequence_parallel=True
+model.expert_model_parallel_size=2
+model.data.train_ds.pad_to_max_length=True
+model.data.test_ds.pad_to_max_length=True
+model.data.validation_ds.pad_to_max_length=True
model.optim.name=fused_adam
model.megatron_amp_O2=True
model.optim.lr=5e-6
model.answer_only_loss=True
model.peft.peft_scheme=none
model.data.train_ds.file_names=${TRAIN}
model.data.validation_ds.file_names=${VALID}
model.data.test_ds.file_names=${TEST}
model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS}
model.data.train_ds.max_seq_length=2048
model.data.train_ds.num_workers=4
model.data.validation_ds.num_workers=4
model.data.test_ds.num_workers=4
model.data.validation_ds.metric.name=loss
model.data.test_ds.metric.name=loss
exp_manager.create_wandb_logger=False
exp_manager.explicit_log_dir=./result-2/
exp_manager.resume_if_exists=True
exp_manager.resume_ignore_no_checkpoint=True
exp_manager.create_checkpoint_callback=True
exp_manager.name=exp-2
exp_manager.checkpoint_callback_params.monitor=validation_loss
exp_manager.checkpoint_callback_params.save_best_model=False
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True
exp_manager.checkpoint_callback_params.mode=min EOF

srun -N 16 -o logs/log-moe-16nodes-%j.txt
--gpus-per-node=8 --ntasks-per-node=8 --cpus-per-task=8 --mem=2000G
--partition=nlp
--container-mounts="/vast/core42-nlp/shared/model_checkpoints/:/checkpoints/"
--job-name=nemo-train
--container-workdir=$PWD
--container-image="/vast/core42-nlp/users/sunil.sahu/nemo_24_03_py3.sqsh"
bash -c "${cmd}"

Expected behaviour

Expanding the number of nodes enhances the DDP size and reduces the required gradient accommodation, which should speed up the training.

Environment details Nvidia Docker Version: nvcr.io#nvidia/nemo:24.03.framework

sunilitggu avatar Apr 24 '24 18:04 sunilitggu

Hi, thanks for reporting this,

Can you retry without EP and report back whether this improves the speed? In addition, I would encourage trying different TP/PP configurations to determine the optimal.

Thank you.

akoumpa avatar Apr 26 '24 19:04 akoumpa

Hi, thanks for reporting this,

Can you retry without EP and report back whether this improves the speed? In addition, I would encourage trying different TP/PP configurations to determine the optimal.

Thank you.

Thank you for your response. We have already attempted the process without EP. However, it proved to be slower compared to when EP was utilized. Below are the average times recorded without EP:

#nodes=4, DP=1, GBPT = 2 sec #nodes=8, DP=2, GBPT = 12 sec #nodes=16, DP=4, GBPT = 34 sec We have also experimented with different combinations for TP and PP, such as 8x4, 4x8 and 8x8. In terms of speed, all configurations performed worse than the one reported in the issue.

sunilitggu avatar Apr 27 '24 04:04 sunilitggu

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar May 28 '24 01:05 github-actions[bot]

@sunilitggu can you try with top of tree NeMo (git clone) and set your optimizer to mcore_distributed_optim (via model.optim.name='mcore_distributed_optim') ?

akoumpa avatar May 28 '24 20:05 akoumpa

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Jun 28 '24 01:06 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Jul 06 '24 01:07 github-actions[bot]