DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Error Encountered When Running pipeline_parallelism in deepspeedexample

Open Sunjnn opened this issue 2 years ago • 4 comments

I encountered the following error while attempting to run the pipeline_parallelism in directory training:

ValueError: Expected input batch_size (8) to match target batch_size (4).
RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([8, 4096]) and output[0] has a shape of torch.Size([4, 4096]).

Steps to Reproduce

Run the following command in directory training.

bash run.sh

And I got the error below.

[2023-10-27 09:06:36,873] [INFO] [logging.py:96:log_dist] [Rank 0] step=10, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
steps: 10 loss: 2.2931 iter time (s): 0.778 samples/sec: 329.220
[2023-10-27 09:06:39,810] [INFO] [logging.py:96:log_dist] [Rank 0] step=20, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
steps: 20 loss: 2.2034 iter time (s): 0.294 samples/sec: 871.983
[2023-10-27 09:06:42,711] [INFO] [logging.py:96:log_dist] [Rank 0] step=30, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
steps: 30 loss: 2.1367 iter time (s): 0.290 samples/sec: 882.464
[2023-10-27 09:06:45,642] [INFO] [logging.py:96:log_dist] [Rank 0] step=40, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
steps: 40 loss: 2.0592 iter time (s): 0.293 samples/sec: 874.900
[2023-10-27 09:06:48,542] [INFO] [logging.py:96:log_dist] [Rank 0] step=50, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
steps: 50 loss: 1.9842 iter time (s): 0.290 samples/sec: 883.154
[2023-10-27 09:06:51,435] [INFO] [logging.py:96:log_dist] [Rank 0] step=60, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
steps: 60 loss: 1.8755 iter time (s): 0.290 samples/sec: 881.762
[2023-10-27 09:06:54,333] [INFO] [logging.py:96:log_dist] [Rank 0] step=70, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
steps: 70 loss: 1.8765 iter time (s): 0.289 samples/sec: 886.708
[2023-10-27 09:06:57,224] [INFO] [logging.py:96:log_dist] [Rank 0] step=80, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
steps: 80 loss: 1.7825 iter time (s): 0.289 samples/sec: 884.299
[2023-10-27 09:07:00,167] [INFO] [logging.py:96:log_dist] [Rank 0] step=90, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
steps: 90 loss: 1.7142 iter time (s): 0.295 samples/sec: 867.864
[2023-10-27 09:07:03,096] [INFO] [logging.py:96:log_dist] [Rank 0] step=100, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
steps: 100 loss: 1.6941 iter time (s): 0.291 samples/sec: 880.091
[2023-10-27 09:07:06,002] [INFO] [logging.py:96:log_dist] [Rank 0] step=110, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
steps: 110 loss: 1.6588 iter time (s): 0.291 samples/sec: 879.327
[2023-10-27 09:07:08,919] [INFO] [logging.py:96:log_dist] [Rank 0] step=120, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
steps: 120 loss: 1.6488 iter time (s): 0.291 samples/sec: 880.643
[2023-10-27 09:07:11,828] [INFO] [logging.py:96:log_dist] [Rank 0] step=130, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
steps: 130 loss: 1.6792 iter time (s): 0.292 samples/sec: 877.401
[2023-10-27 09:07:14,735] [INFO] [logging.py:96:log_dist] [Rank 0] step=140, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
steps: 140 loss: 1.6311 iter time (s): 0.289 samples/sec: 884.600
[2023-10-27 09:07:17,667] [INFO] [logging.py:96:log_dist] [Rank 0] step=150, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
steps: 150 loss: 1.6834 iter time (s): 0.292 samples/sec: 877.694
[2023-10-27 09:07:20,619] [INFO] [logging.py:96:log_dist] [Rank 0] step=160, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
steps: 160 loss: 1.6609 iter time (s): 0.295 samples/sec: 867.698
[2023-10-27 09:07:23,513] [INFO] [logging.py:96:log_dist] [Rank 0] step=170, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
steps: 170 loss: 1.6003 iter time (s): 0.289 samples/sec: 884.632
[2023-10-27 09:07:26,454] [INFO] [logging.py:96:log_dist] [Rank 0] step=180, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
steps: 180 loss: 1.5324 iter time (s): 0.294 samples/sec: 871.666
[2023-10-27 09:07:29,384] [INFO] [logging.py:96:log_dist] [Rank 0] step=190, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
steps: 190 loss: 1.5118 iter time (s): 0.293 samples/sec: 874.573
Traceback (most recent call last):
  File "/home/sun/data/DeepSpeedExamples/training/pipeline_parallelism/train.py", line 159, in <module>
    train_pipe(args)
  File "/home/sun/data/DeepSpeedExamples/training/pipeline_parallelism/train.py", line 146, in train_pipe
    loss = engine.train_batch()
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 336, in train_batch
    self._exec_schedule(sched)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 1307, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 660, in _exec_forward_pass
    self.loss = self.module.loss_fn(outputs, labels)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch/nn/functional.py", line 3026, in cross_entropy
Traceback (most recent call last):
  File "/home/sun/data/DeepSpeedExamples/training/pipeline_parallelism/train.py", line 159, in <module>
    train_pipe(args)
  File "/home/sun/data/DeepSpeedExamples/training/pipeline_parallelism/train.py", line 146, in train_pipe
    loss = engine.train_batch()
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 336, in train_batch
    self._exec_schedule(sched)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 1307, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 660, in _exec_forward_pass
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ValueError: Expected input batch_size (8) to match target batch_size (4).
    self.loss = self.module.loss_fn(outputs, labels)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch/nn/functional.py", line 3026, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ValueError: Expected input batch_size (8) to match target batch_size (4).
Traceback (most recent call last):
  File "/home/sun/data/DeepSpeedExamples/training/pipeline_parallelism/train.py", line 159, in <module>
    train_pipe(args)
  File "/home/sun/data/DeepSpeedExamples/training/pipeline_parallelism/train.py", line 146, in train_pipe
    loss = engine.train_batch()
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 336, in train_batch
    self._exec_schedule(sched)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 1307, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 660, in _exec_forward_pass
    self.loss = self.module.loss_fn(outputs, labels)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch/nn/functional.py", line 3026, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ValueError: Expected input batch_size (8) to match target batch_size (4).
Traceback (most recent call last):
  File "/home/sun/data/DeepSpeedExamples/training/pipeline_parallelism/train.py", line 159, in <module>
    train_pipe(args)
  File "/home/sun/data/DeepSpeedExamples/training/pipeline_parallelism/train.py", line 146, in train_pipe
    loss = engine.train_batch()
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 336, in train_batch
    self._exec_schedule(sched)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 1307, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 660, in _exec_forward_pass
    self.loss = self.module.loss_fn(outputs, labels)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch/nn/functional.py", line 3026, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
ValueError: Expected input batch_size (8) to match target batch_size (4).
Traceback (most recent call last):
  File "/home/sun/data/DeepSpeedExamples/training/pipeline_parallelism/train.py", line 159, in <module>
    train_pipe(args)
  File "/home/sun/data/DeepSpeedExamples/training/pipeline_parallelism/train.py", line 146, in train_pipe
    loss = engine.train_batch()
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 336, in train_batch
    self._exec_schedule(sched)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 1307, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 735, in _exec_backward_pass
    torch.autograd.backward(tensors=(outputs, ), grad_tensors=(grad_tensors, ))
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch/autograd/__init__.py", line 190, in backward
    grad_tensors_ = _make_grads(tensors, grad_tensors_, is_grads_batched=False)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch/autograd/__init__.py", line 68, in _make_grads
    raise RuntimeError("Mismatch in shape: grad_output["
RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([8, 4096]) and output[0] has a shape of torch.Size([4, 4096]).
Traceback (most recent call last):
  File "/home/sun/data/DeepSpeedExamples/training/pipeline_parallelism/train.py", line 159, in <module>
    train_pipe(args)
  File "/home/sun/data/DeepSpeedExamples/training/pipeline_parallelism/train.py", line 146, in train_pipe
    loss = engine.train_batch()
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 336, in train_batch
    self._exec_schedule(sched)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 1307, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 735, in _exec_backward_pass
    torch.autograd.backward(tensors=(outputs, ), grad_tensors=(grad_tensors, ))
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch/autograd/__init__.py", line 190, in backward
    grad_tensors_ = _make_grads(tensors, grad_tensors_, is_grads_batched=False)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch/autograd/__init__.py", line 68, in _make_grads
    raise RuntimeError("Mismatch in shape: grad_output["
RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([8, 4096]) and output[0] has a shape of torch.Size([4, 4096]).
Traceback (most recent call last):
  File "/home/sun/data/DeepSpeedExamples/training/pipeline_parallelism/train.py", line 159, in <module>
    train_pipe(args)
  File "/home/sun/data/DeepSpeedExamples/training/pipeline_parallelism/train.py", line 146, in train_pipe
    loss = engine.train_batch()
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 336, in train_batch
    self._exec_schedule(sched)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 1307, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 735, in _exec_backward_pass
    torch.autograd.backward(tensors=(outputs, ), grad_tensors=(grad_tensors, ))
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch/autograd/__init__.py", line 190, in backward
    grad_tensors_ = _make_grads(tensors, grad_tensors_, is_grads_batched=False)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch/autograd/__init__.py", line 68, in _make_grads
    raise RuntimeError("Mismatch in shape: grad_output["
RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([8, 4096]) and output[0] has a shape of torch.Size([4, 4096]).
Traceback (most recent call last):
  File "/home/sun/data/DeepSpeedExamples/training/pipeline_parallelism/train.py", line 159, in <module>
    train_pipe(args)
  File "/home/sun/data/DeepSpeedExamples/training/pipeline_parallelism/train.py", line 146, in train_pipe
    loss = engine.train_batch()
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 336, in train_batch
    self._exec_schedule(sched)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 1307, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/deepspeed/runtime/pipe/engine.py", line 735, in _exec_backward_pass
    torch.autograd.backward(tensors=(outputs, ), grad_tensors=(grad_tensors, ))
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch/autograd/__init__.py", line 190, in backward
    grad_tensors_ = _make_grads(tensors, grad_tensors_, is_grads_batched=False)
  File "/home/sun/apps/miniconda3/envs/deepspeed/lib/python3.10/site-packages/torch/autograd/__init__.py", line 68, in _make_grads
    raise RuntimeError("Mismatch in shape: grad_output["
RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([8, 4096]) and output[0] has a shape of torch.Size([4, 4096]).
[2023-10-27 09:07:32,546] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 12353
[2023-10-27 09:07:32,861] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 12354
[2023-10-27 09:07:33,303] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 12355
[2023-10-27 09:07:33,585] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 12356
[2023-10-27 09:07:33,904] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 12357
[2023-10-27 09:07:33,935] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 12358
[2023-10-27 09:07:33,955] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 12359
[2023-10-27 09:07:33,955] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 12360
[2023-10-27 09:07:33,973] [ERROR] [launch.py:321:sigkill_handler] ['/home/sun/apps/miniconda3/envs/deepspeed/bin/python', '-u', 'train.py', '--local_rank=7', '--deepspeed_config=ds_config.json', '-p', '2', '--steps=200'] exits with return code = 1

Environment Information

  • Operating System: Ubuntu 18.04
  • conda: 23.5.0
  • python: 3.10.12
  • DeepSpeed: 0.9.5

Thanks for help.

Sunjnn avatar Oct 27 '23 01:10 Sunjnn

I met the same error, have you soved it?

xueyingliu avatar Oct 30 '23 07:10 xueyingliu

I set --steps=100 in run.sh, and the error disappeared.

xueyingliu avatar Oct 30 '23 08:10 xueyingliu

It works! Do you know how to set the value of steps correctly?

Sunjnn avatar Oct 31 '23 06:10 Sunjnn

@Sunjnn I don't know, I just change the steps by chance.

xueyingliu avatar Oct 31 '23 07:10 xueyingliu