NeMo
NeMo copied to clipboard
Slow training on Mixtral-8x22B when DP size > 1
Describe the bug
We are in the process of fine-tuning Mixtral-8x22b using an instruction fine-tuning dataset. The model is divided using PP=8 and TP=4. Our experiments are conducted on DGX nodes, each equipped with 8 H100 GPUs. Nodes are interconnected through a 3.2 TbPS InfiniBand.
The model was tested with various DP sizes, such as 1, 2, 4, and 8. Throughout all experiments, we maintained a Micro Batch Size of 1 and a Global Batch Size of 128; gradient accommodation occurs according to DP size.
Here are the average global batch processing times (GBPT):
#nodes=4, DP=1, GBPT = 2 sec #nodes=8, DP=2, GBPT = 10 sec #nodes=16, DP=4, GBPT = 32 sec #nodes=32, DP=8, GBPT = 36 sec
Adding more nodes exponentially increases training time.
Steps/Code to reproduce bug
TRAIN="[/vast/core42-nlp/users/sunil.sahu/dataset/v10p1_llama_temp/train_test/test.jsonl]" VALID="[/vast/core42-nlp/users/sunil.sahu/dataset/v10p1_llama_temp/train_test/test.jsonl]" TEST="[/vast/core42-nlp/users/sunil.sahu/dataset/v10p1_llama_temp/train_test/test.jsonl]" MODEL="/checkpoints/mixtral-8x22b-v0.1/nemo-checkpoints/"
VALID_NAMES="v10p1" CONCAT_SAMPLING_PROBS="[1]"
read -r -d '' cmd <<EOF
echo "STARTING*"
&& python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py
trainer.precision=bf16
trainer.devices=8
trainer.num_nodes=16
trainer.val_check_interval=1000
trainer.max_steps=10000
trainer.log_every_n_steps=1
trainer.use_distributed_sampler=False
model.restore_from_path=${MODEL}
model.micro_batch_size=1
model.global_batch_size=128
model.tensor_model_parallel_size=4
model.pipeline_model_parallel_size=8
model.sequence_parallel=True
+model.expert_model_parallel_size=2
+model.data.train_ds.pad_to_max_length=True
+model.data.test_ds.pad_to_max_length=True
+model.data.validation_ds.pad_to_max_length=True
model.optim.name=fused_adam
model.megatron_amp_O2=True
model.optim.lr=5e-6
model.answer_only_loss=True
model.peft.peft_scheme=none
model.data.train_ds.file_names=${TRAIN}
model.data.validation_ds.file_names=${VALID}
model.data.test_ds.file_names=${TEST}
model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS}
model.data.train_ds.max_seq_length=2048
model.data.train_ds.num_workers=4
model.data.validation_ds.num_workers=4
model.data.test_ds.num_workers=4
model.data.validation_ds.metric.name=loss
model.data.test_ds.metric.name=loss
exp_manager.create_wandb_logger=False
exp_manager.explicit_log_dir=./result-2/
exp_manager.resume_if_exists=True
exp_manager.resume_ignore_no_checkpoint=True
exp_manager.create_checkpoint_callback=True
exp_manager.name=exp-2
exp_manager.checkpoint_callback_params.monitor=validation_loss
exp_manager.checkpoint_callback_params.save_best_model=False
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True
exp_manager.checkpoint_callback_params.mode=min
EOF
srun -N 16 -o logs/log-moe-16nodes-%j.txt
--gpus-per-node=8 --ntasks-per-node=8 --cpus-per-task=8 --mem=2000G
--partition=nlp
--container-mounts="/vast/core42-nlp/shared/model_checkpoints/:/checkpoints/"
--job-name=nemo-train
--container-workdir=$PWD
--container-image="/vast/core42-nlp/users/sunil.sahu/nemo_24_03_py3.sqsh"
bash -c "${cmd}"
Expected behaviour
Expanding the number of nodes enhances the DDP size and reduces the required gradient accommodation, which should speed up the training.
Environment details Nvidia Docker Version: nvcr.io#nvidia/nemo:24.03.framework
Hi, thanks for reporting this,
Can you retry without EP and report back whether this improves the speed? In addition, I would encourage trying different TP/PP configurations to determine the optimal.
Thank you.
Hi, thanks for reporting this,
Can you retry without EP and report back whether this improves the speed? In addition, I would encourage trying different TP/PP configurations to determine the optimal.
Thank you.
Thank you for your response. We have already attempted the process without EP. However, it proved to be slower compared to when EP was utilized. Below are the average times recorded without EP:
#nodes=4, DP=1, GBPT = 2 sec #nodes=8, DP=2, GBPT = 12 sec #nodes=16, DP=4, GBPT = 34 sec We have also experimented with different combinations for TP and PP, such as 8x4, 4x8 and 8x8. In terms of speed, all configurations performed worse than the one reported in the issue.
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
@sunilitggu can you try with top of tree NeMo (git clone) and set your optimizer to mcore_distributed_optim (via model.optim.name='mcore_distributed_optim') ?
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.