DeepSpeed [BUG] `Assert Error: assert buffer.grad is not None` & `RuntimeError: element 1 of tensors does not require grad and does not have a grad

Describe the bug

I encountered these errors while using pipeline parallelism to fine-tune an LLM. The issue seems to arise when transmitting non-gradient variables:

Traceback (most recent call last):
  File "deepspeed_train_test.py", line 226, in <module>
    loss = engine.train_batch(data_iter=train_iter)
  File "/home/**/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 389, in train_batch
Traceback (most recent call last):
  File "deepspeed_train_test.py", line 226, in <module>
    self._exec_schedule(sched)
  File "/home/**/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1433, in _exec_schedule
    loss = engine.train_batch(data_iter=train_iter)
  File "/home/**/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 389, in train_batch
    self._exec_schedule(sched)
  File "/home/**/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1433, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/home/**/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1113, in _exec_send_grads
    self._exec_instr(**cmd.kwargs)
  File "/home/**/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 863, in _exec_backward_pass
    assert buffer.grad is not None
AssertionError
    torch.autograd.backward(tensors=out_tensors, grad_tensors=grad_tensors)
  File "/home/**/lib/python3.8/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: element 1 of tensors does not require grad and does not have a grad_fn

while I need to transmit the non-gradient variable between layers such as pos embedding or pos ids.

Despite setting activation_checkpoint_interval=0 (as suggested in (#4274)) and using LayerSpec (as mentioned in (#4479)), the problem persists. Neither checkpointable_layers nor activation_checkpoint_func resolved it.

I'm opening a new issue since this appears seems to be distinct from the previously reported problems.

Are there any solutions?

To Reproduce Uploading all codings: deep_speed_issue.zip

run bash deep_speed_train.sh and see the error

Expected behavior You should get the following traceback and error, as shown above.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
dc ..................... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
 [WARNING]  FP Quantizer is using an untested triton version (2.0.0), only 2.3.(0, 1) and 3.0.0 are known to be compatible with these kernels
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -c /JOBs/tmpdir/pbs.12835575.spcc-adm1/tmp_q5ced0y/test.c -o /JOBs/tmpdir/pbs.12835575.spcc-adm1/tmp_q5ced0y/test.o
x86_64-linux-gnu-gcc -pthread /JOBs/tmpdir/pbs.12835575.spcc-adm1/tmp_q5ced0y/test.o -L/usr/local/cuda -L/usr/local/cuda/lib64 -lcufile -o /JOBs/tmpdir/pbs.12835575.spcc-adm1/tmp_q5ced0y/a.out
/usr/bin/ld: cannot find -lcufile
collect2: error: ld returned 1 exit status
gds .................... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/**/lib/python3.8/site-packages/torch']
torch version .................... 2.0.1+cu117
deepspeed install path ........... ['/home/**/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.16.7, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7
shared memory (/dev/shm) size .... 251.61 GB

System info (please complete the following information):

OS: Ubuntu 20.04
GPU count and types: single machines with A40s x2
Interconnects: None
Python version: 3.8
Any other relevant info about your setup: transformers==4.46.3, torch==2.0.1+cu117,

Launcher context deepspeed --num_gpus 2 deepspeed_train_test.py --deepspeed_config deepspeed_config.json

Docker context No

May 03 '25 11:05 mmkjj

hello, I guess you're not alone. I was also transmitting some tensors between layers that don't require gradient, and I kept getting the grad error. Then I tried making each layer compute position_id, rotary_emb, so that the layer transmit hidden_states only, and also by disabling the ckpt, activation_checkpoint_interval = 0 , finally got it to work. Would also like to know if transmitting a tuple of tensors between layers, that some of them don't require gradients, is possible

May 06 '25 20:05 samuelqy

To Reproduce Uploading all codings: deep_speed_issue.zip

@mmkjj thanks for providing a repro. Instead of zip file, can you please share as gist or repo? Thanks!

Jun 06 '25 15:06 tjruwase

@tjruwase I create a new repo: https://github.com/mmkjj/deepspeed_pipeline_parallelism_issue Is this work for you?

Jun 06 '25 15:06 mmkjj

DeepSpeed
DeepSpeed copied to clipboard

[BUG] `Assert Error: assert buffer.grad is not None` & `RuntimeError: element 1 of tensors does not require grad and does not have a grad_fn` During pipeline parallelism

DeepSpeed DeepSpeed copied to clipboard

[BUG] `Assert Error: assert buffer.grad is not None` & `RuntimeError: element 1 of tensors does not require grad and does not have a grad_fn` During pipeline parallelism

DeepSpeed
DeepSpeed copied to clipboard