DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Megatron-LM-v1.1.5-3D_parallelism ds_pretrain_gpt2_pipe.sh error
Error occurred running megatron/Megatron-LM-v1.1.5-3D_parallelism/examples/ds_pretrain_gpt2_pipe.sh
finished creating GPT2 datasets ... setting training data start iteration to 0 setting validation data start iteration to 0 done with setups ... time (ms) | model and optimizer: 1716.01 | train/valid/test data iterators: 5358.74 training ... [2022-07-06 11:31:24,416] [INFO] [checkpointing.py:547:forward] Activation Checkpointing Information [2022-07-06 11:31:24,416] [INFO] [checkpointing.py:548:forward] ----Partition Activations False, CPU CHECKPOINTING False [2022-07-06 11:31:24,416] [INFO] [checkpointing.py:551:forward] ----contiguous Memory Checkpointing False with None total layers [2022-07-06 11:31:24,416] [INFO] [checkpointing.py:554:forward] ----Synchronization False [2022-07-06 11:31:24,416] [INFO] [checkpointing.py:555:forward] ----Profiling time in checkpointing False Traceback (most recent call last): Traceback (most recent call last): File "pretrain_gpt2.py", line 157, in
File "pretrain_gpt2.py", line 157, in Traceback (most recent call last): File "pretrain_gpt2.py", line 157, in pretrain(train_valid_test_datasets_provider, model_provider, forward_step,pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
File "/root/zsf/DeepSpeedExamples-master/megatron/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 98, in pretrain
File "/root/zsf/DeepSpeedExamples-master/megatron/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 98, in pretrain
iteration = train(forward_step_func,
File "/root/zsf/DeepSpeedExamples-master/megatron/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 481, in train
pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
File "/root/zsf/DeepSpeedExamples-master/megatron/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 98, in pretrain
Traceback (most recent call last):
iteration = train(forward_step_func,
File "/root/zsf/DeepSpeedExamples-master/megatron/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 481, in train
File "pretrain_gpt2.py", line 157, in
File "/root/zsf/DeepSpeedExamples-master/megatron/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 325, in train_step File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 353, in train_batch loss = model.train_batch(data_iter=data_iterator) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 353, in train_batch pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/root/zsf/DeepSpeedExamples-master/megatron/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 98, in pretrain self._exec_schedule(sched) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1384, in _exec_schedule return train_step_pipe(model, data_iterator) self._exec_schedule(sched) File "/root/zsf/DeepSpeedExamples-master/megatron/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 359, in train_step_pipe File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1384, in _exec_schedule iteration = train(forward_step_func, File "/root/zsf/DeepSpeedExamples-master/megatron/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 481, in train loss_dict, skipped_iter = train_step(forward_step_func, File "/root/zsf/DeepSpeedExamples-master/megatron/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 325, in train_step loss = model.train_batch(data_iter=data_iterator) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 353, in train_batch self._exec_instr(**cmd.kwargs) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1034, in _exec_send_grads return train_step_pipe(model, data_iterator) self._exec_instr(**cmd.kwargs) File "/root/zsf/DeepSpeedExamples-master/megatron/Megatron-LM-v1.1.5-3D_parallelism/megatron/training.py", line 359, in train_step_pipe File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1034, in _exec_send_grads self._exec_schedule(sched) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1384, in _exec_schedule loss = model.train_batch(data_iter=data_iterator) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 353, in train_batch p2p.send(inputs[1], self.prev_stage) IndexError: tuple index out of range p2p.send(inputs[1], self.prev_stage) IndexError: tuple index out of range self._exec_schedule(sched) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1384, in _exec_schedule self._exec_instr(**cmd.kwargs) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1034, in _exec_send_grads self._exec_instr(**cmd.kwargs) File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1034, in _exec_send_grads p2p.send(inputs[1], self.prev_stage) IndexError: tuple index out of range p2p.send(inputs[1], self.prev_stage) IndexError: tuple index out of range
ds_report
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
cpu_adam ............... [YES] ...... [OKAY] cpu_adagrad ............ [YES] ...... [OKAY] fused_adam ............. [YES] ...... [OKAY] fused_lamb ............. [YES] ...... [OKAY] sparse_attn ............ [YES] ...... [OKAY] transformer ............ [YES] ...... [OKAY] stochastic_transformer . [YES] ...... [OKAY] async_io ............... [YES] ...... [OKAY] utils .................. [YES] ...... [OKAY] quantizer .............. [YES] ...... [OKAY] transformer_inference .. [YES] ...... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/opt/conda/lib/python3.8/site-packages/torch'] torch version .................... 1.11.0a0+b6df043 torch cuda version ............... 11.5 torch hip version ................ None nvcc version ..................... 11.5 deepspeed install path ........... ['/opt/conda/lib/python3.8/site-packages/deepspeed'] deepspeed info ................... 0.6.5, unknown, unknown deepspeed wheel compiled w. ...... torch 1.11, cuda 11.5
Have you solved this problem?
@drcege, this is a deprecated megatron. Please use https://github.com/microsoft/Megatron-DeepSpeed
@tjruwase Thanks. Indeed, I was trying gpt-neox and encountered the same issue. So I search for the cause and solution.
I know microsoft's Megatron-Deepspeed, but the problem is that the Megatron version is a little old and new improvements like distributed optimizer and flash attention are not intergrated. It would be great if it is rebased onto the latest Megatron.
@drcege, you are correct that Megatron-DeepSpeed is behind latest Megatron. From the stack trace, the failure seems to be in DeepSpeed. So, the right thing to do is to open an issue with DeepSpeed using your gpt-neox failure. Would that work for you?