DeepSpeedExamples Step 3: RuntimeError: CUDA error: misaligned address

I try to run RLHF for my previously trained Actor and Reward model. However, I encounter the following Exception:

Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 516, in <module>
    main()
  File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 425, in main
    out = trainer.generate_experience(prompts)
  File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 97, in generate_experience
    seq = self._generate_sequence(prompts)
  File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 73, in _generate_sequence
    seq = self.actor_model.module.generate(prompts,
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py", line 238, in generate
    with GatheredParameters(non_active_layers):
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1596, in __exit__
    self.params[0].partition(param_list=self.params, has_been_updated=False)
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 913, in partition
    self._partition(param_list, has_been_updated=has_been_updated)
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1050, in _partition
    self._partition_param(param, has_been_updated=has_been_updated)
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1060, in _partition_param
    assert param.ds_status is not ZeroParamStatus.INFLIGHT, f" {param} Cannot partition a param in flight"
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/torch/_tensor.py", line 873, in __format__
    return object.__format__(self, format_spec)
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/torch/nn/parameter.py", line 60, in __repr__
    return 'Parameter containing:\n' + super().__repr__()
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/torch/_tensor.py", line 426, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/torch/_tensor_str.py", line 636, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/torch/_tensor_str.py", line 567, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/torch/_tensor_str.py", line 309, in _tensor_str
    self = self.float()
RuntimeError: CUDA error: misaligned address
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I then set TORCH_USE_CUDA_DSA=1 and CUDA_LAUNCH_BLOCKING=1 for better debugging this results in the following, more elaborate error message:

******************[end] Initialized Reward Model [end] (duration: 2.13s)******************
***** Running training *****
Beginning of Epoch 1/1, Total Generation Batches 3692
------------------------------------------------------
Free memory : 9.476501 (GigaBytes)  
Total memory: 14.620972 (GigaBytes)  
Requested memory: 0.073242 (GigaBytes) 
Setting maximum total tokens (input + output) to 1024 
WorkSpace: 0x7fea90000000 
------------------------------------------------------
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13) 
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13) 
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13) 
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13) 
Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 516, in <module>
    main()
  File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 425, in main
    out = trainer.generate_experience(prompts)
  File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 97, in generate_experience
    seq = self._generate_sequence(prompts)
  File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 73, in _generate_sequence
    seq = self.actor_model.module.generate(prompts,
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/deepspeed/runtime/hybrid_engine.py", line 254, in generate
    generate_ret_vals = self._generate(*inputs, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/transformers/generation/utils.py", line 1508, in generate
    return self.greedy_search(
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/transformers/generation/utils.py", line 2325, in greedy_search
    outputs = self(
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1097, in forward
    lm_logits = self.lm_head(hidden_states)
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/deepspeed/module_inject/layers.py", line 50, in forward
    output = torch.matmul(input, self.weight.transpose(-1, -2))
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: misaligned address
Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1670525539683/work/c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7feb8784f457 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7feb878193ec in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7febb956c044 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x164bc (0x7febb95434bc in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x244 (0x7febb9546434 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4cf653 (0x7febcf6f9653 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7feb8782f9e0 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7feb8782faf9 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x72d9c8 (0x7febcf9579c8 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2a5 (0x7febcf957cb5 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x127cc8 (0x564c098abcc8 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #11: <unknown function> + 0x24be98 (0x564c099cfe98 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #12: <unknown function> + 0x127db5 (0x564c098abdb5 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #13: <unknown function> + 0x150776 (0x564c098d4776 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #14: <unknown function> + 0x127cc8 (0x564c098abcc8 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #15: <unknown function> + 0x24be98 (0x564c099cfe98 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #16: <unknown function> + 0x127db5 (0x564c098abdb5 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #17: <unknown function> + 0x150776 (0x564c098d4776 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #18: <unknown function> + 0x127cc8 (0x564c098abcc8 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #19: <unknown function> + 0x24be98 (0x564c099cfe98 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #20: <unknown function> + 0x127db5 (0x564c098abdb5 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #21: <unknown function> + 0x150776 (0x564c098d4776 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #22: <unknown function> + 0x127db5 (0x564c098abdb5 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #23: <unknown function> + 0x150776 (0x564c098d4776 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #24: <unknown function> + 0x127db5 (0x564c098abdb5 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #25: <unknown function> + 0x150776 (0x564c098d4776 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #26: <unknown function> + 0x1348e8 (0x564c098b88e8 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #27: <unknown function> + 0x14860e (0x564c098cc60e in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #28: <unknown function> + 0x1485fb (0x564c098cc5fb in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #29: <unknown function> + 0x11c661 (0x564c098a0661 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #30: PyDict_SetItemString + 0x4a (0x564c098a66aa in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #31: <unknown function> + 0x21470c (0x564c0999870c in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #32: Py_FinalizeEx + 0x186 (0x564c09997856 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #33: Py_RunMain + 0x112 (0x564c0998afe2 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #34: Py_BytesMain + 0x39 (0x564c0995d979 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #35: __libc_start_main + 0xea (0x7fec0ca6113a in /lib64/libc.so.6)
frame #36: <unknown function> + 0x1d9881 (0x564c0995d881 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)

This is my bash script:

export OUTPUT_PATH=./output
mkdir -p $OUTPUT_PATH

ACTOR_ZERO_STAGE="--actor_zero_stage 2"
CRITIC_ZERO_STAGE="--critic_zero_stage 2"
ACTOR_MODEL_PATH="../step1_supervised_finetuning/output" # Provide the ckpt path of the actor model
CRITIC_MODEL_PATH="../step2_reward_model_finetuning/output" # Provide the ckpt path of the critic model

Actor_Lr=5e-4
Critic_Lr=5e-6

deepspeed --master_port 12346 main.py \
   --data_path my_dataset.json \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 0 \
   --per_device_train_batch_size 1 \
   --per_device_mini_train_batch_size 1 \
   --generation_batch_numbers 1 \
   --ppo_epochs 1 \
   --max_answer_seq_len 512 \
   --max_prompt_seq_len 512 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --actor_weight_decay 0.1 \
   --critic_weight_decay 0.1 \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 1 \
   --num_warmup_steps 100 \
   --deepspeed --seed 1234 \
   ${ACTOR_ZERO_STAGE} \
   ${CRITIC_ZERO_STAGE} \
   --actor_lora_dim 128 \
   --enable_hybrid_engine \
   --actor_lora_module_name decoder.layers. \
   --output_dir $OUTPUT_PATH \
    &> $OUTPUT_PATH/training.log

Apr 21 '23 09:04 EikeKohl

@EikeKohl have you solved this problem, same Error

Apr 23 '23 08:04 XiaoLaoDi

Hey @XiaoLaoDi not yet, but here is what I tried so far:

Use a machine with more VRAM that should definitely be able to fit the model, to rule out any OOM issues
Setup a new venv: py3.10, torch==2.0.0, transformers==4.28.1 and from git as specified in the requirements.txt (I tried both) on an EC2 instance that is not a Sagemaker instance and use pyenv instead of conda for handeling the venv
Try a model from a different model family (Instead of GPT2, I tried OPT)

As for the CUDA setup, I tried

the pre-installed versions available on AWS Sagemaker instances,
running the training in the pytorch/pytorch docker container (pytorch/pytorch), and
The amazon/Deep Learning AMI GPU PyTorch 2.0.0 (Ubuntu 20.04) 20230401 image for the EC2 instance

However, the error remains...

Apr 24 '23 07:04 EikeKohl

@XiaoLaoDi How does your setup look like? Maybe we can identify similiarities and possible problem areas

Apr 24 '23 08:04 EikeKohl

maybe see this issue https://github.com/microsoft/DeepSpeedExamples/issues/335#issuecomment-1521105300

Apr 25 '23 09:04 ruihan0495

@ruihan0495 thank you for the info. Not using the DeepSpeed-HE does indeed make a training possible 🙂. I run into another exception a little later in the code, but that is probably due to poor model quality. I am currently working on fixing that issue as well.

Apr 25 '23 13:04 EikeKohl

@EikeKohl I got the same error when I trained GPT2 and I also used AWS EC2 instance. Did you figure out what the problem is?

Jul 14 '23 18:07 DehongXu

@DehongXu tbh I didn't use Deepspeed RLHF in a while, but I remember that there was a known issue with the hybrid engine that was supposed to be fixed in upcoming updates. Disabling the usage of the hybrid engine made it worked for me at that time.

Jul 15 '23 08:07 EikeKohl

DeepSpeedExamples DeepSpeedExamples copied to clipboard

Step 3: RuntimeError: CUDA error: misaligned address

DeepSpeedExamples
DeepSpeedExamples copied to clipboard