DeepSpeedExamples
DeepSpeedExamples copied to clipboard
Step 3: RuntimeError: CUDA error: misaligned address
I try to run RLHF for my previously trained Actor and Reward model. However, I encounter the following Exception:
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 516, in <module>
main()
File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 425, in main
out = trainer.generate_experience(prompts)
File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 97, in generate_experience
seq = self._generate_sequence(prompts)
File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 73, in _generate_sequence
seq = self.actor_model.module.generate(prompts,
File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py", line 238, in generate
with GatheredParameters(non_active_layers):
File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1596, in __exit__
self.params[0].partition(param_list=self.params, has_been_updated=False)
File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 913, in partition
self._partition(param_list, has_been_updated=has_been_updated)
File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1050, in _partition
self._partition_param(param, has_been_updated=has_been_updated)
File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1060, in _partition_param
assert param.ds_status is not ZeroParamStatus.INFLIGHT, f" {param} Cannot partition a param in flight"
File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/torch/_tensor.py", line 873, in __format__
return object.__format__(self, format_spec)
File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/torch/nn/parameter.py", line 60, in __repr__
return 'Parameter containing:\n' + super().__repr__()
File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/torch/_tensor.py", line 426, in __repr__
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/torch/_tensor_str.py", line 636, in _str
return _str_intern(self, tensor_contents=tensor_contents)
File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/torch/_tensor_str.py", line 567, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/home/ec2-user/anaconda3/lib/python3.10/site-packages/torch/_tensor_str.py", line 309, in _tensor_str
self = self.float()
RuntimeError: CUDA error: misaligned address
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
I then set TORCH_USE_CUDA_DSA=1
and CUDA_LAUNCH_BLOCKING=1
for better debugging this results in the following, more elaborate error message:
******************[end] Initialized Reward Model [end] (duration: 2.13s)******************
***** Running training *****
Beginning of Epoch 1/1, Total Generation Batches 3692
------------------------------------------------------
Free memory : 9.476501 (GigaBytes)
Total memory: 14.620972 (GigaBytes)
Requested memory: 0.073242 (GigaBytes)
Setting maximum total tokens (input + output) to 1024
WorkSpace: 0x7fea90000000
------------------------------------------------------
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13)
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13)
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13)
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13)
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13)
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13)
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13)
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13)
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13)
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13)
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13)
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 514, n: 1, k: 21, error: 13)
!!!! kernel execution error. (m: 21, n: 1, k: 514, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 256, error: 13)
!!!! kernel execution error. (m: 768, n: 1, k: 768, error: 13)
!!!! kernel execution error. (m: 3072, n: 1, k: 768, error: 13)
Traceback (most recent call last):
File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 516, in <module>
main()
File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 425, in main
out = trainer.generate_experience(prompts)
File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 97, in generate_experience
seq = self._generate_sequence(prompts)
File "/home/ec2-user/SageMaker/deepspeedexamples-fork/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 73, in _generate_sequence
seq = self.actor_model.module.generate(prompts,
File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/deepspeed/runtime/hybrid_engine.py", line 254, in generate
generate_ret_vals = self._generate(*inputs, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/transformers/generation/utils.py", line 1508, in generate
return self.greedy_search(
File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/transformers/generation/utils.py", line 2325, in greedy_search
outputs = self(
File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1097, in forward
lm_logits = self.lm_head(hidden_states)
File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/deepspeed/module_inject/layers.py", line 50, in forward
output = torch.matmul(input, self.weight.transpose(-1, -2))
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)`
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: misaligned address
Exception raised from c10_cuda_check_implementation at /opt/conda/conda-bld/pytorch_1670525539683/work/c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7feb8784f457 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7feb878193ec in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7febb956c044 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x164bc (0x7febb95434bc in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x244 (0x7febb9546434 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x4cf653 (0x7febcf6f9653 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7feb8782f9e0 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7feb8782faf9 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x72d9c8 (0x7febcf9579c8 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2a5 (0x7febcf957cb5 in /home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x127cc8 (0x564c098abcc8 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #11: <unknown function> + 0x24be98 (0x564c099cfe98 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #12: <unknown function> + 0x127db5 (0x564c098abdb5 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #13: <unknown function> + 0x150776 (0x564c098d4776 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #14: <unknown function> + 0x127cc8 (0x564c098abcc8 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #15: <unknown function> + 0x24be98 (0x564c099cfe98 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #16: <unknown function> + 0x127db5 (0x564c098abdb5 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #17: <unknown function> + 0x150776 (0x564c098d4776 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #18: <unknown function> + 0x127cc8 (0x564c098abcc8 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #19: <unknown function> + 0x24be98 (0x564c099cfe98 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #20: <unknown function> + 0x127db5 (0x564c098abdb5 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #21: <unknown function> + 0x150776 (0x564c098d4776 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #22: <unknown function> + 0x127db5 (0x564c098abdb5 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #23: <unknown function> + 0x150776 (0x564c098d4776 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #24: <unknown function> + 0x127db5 (0x564c098abdb5 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #25: <unknown function> + 0x150776 (0x564c098d4776 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #26: <unknown function> + 0x1348e8 (0x564c098b88e8 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #27: <unknown function> + 0x14860e (0x564c098cc60e in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #28: <unknown function> + 0x1485fb (0x564c098cc5fb in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #29: <unknown function> + 0x11c661 (0x564c098a0661 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #30: PyDict_SetItemString + 0x4a (0x564c098a66aa in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #31: <unknown function> + 0x21470c (0x564c0999870c in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #32: Py_FinalizeEx + 0x186 (0x564c09997856 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #33: Py_RunMain + 0x112 (0x564c0998afe2 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #34: Py_BytesMain + 0x39 (0x564c0995d979 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
frame #35: __libc_start_main + 0xea (0x7fec0ca6113a in /lib64/libc.so.6)
frame #36: <unknown function> + 0x1d9881 (0x564c0995d881 in /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python3.9)
This is my bash script:
export OUTPUT_PATH=./output
mkdir -p $OUTPUT_PATH
ACTOR_ZERO_STAGE="--actor_zero_stage 2"
CRITIC_ZERO_STAGE="--critic_zero_stage 2"
ACTOR_MODEL_PATH="../step1_supervised_finetuning/output" # Provide the ckpt path of the actor model
CRITIC_MODEL_PATH="../step2_reward_model_finetuning/output" # Provide the ckpt path of the critic model
Actor_Lr=5e-4
Critic_Lr=5e-6
deepspeed --master_port 12346 main.py \
--data_path my_dataset.json \
--data_split 2,4,4 \
--actor_model_name_or_path $ACTOR_MODEL_PATH \
--critic_model_name_or_path $CRITIC_MODEL_PATH \
--num_padding_at_beginning 0 \
--per_device_train_batch_size 1 \
--per_device_mini_train_batch_size 1 \
--generation_batch_numbers 1 \
--ppo_epochs 1 \
--max_answer_seq_len 512 \
--max_prompt_seq_len 512 \
--actor_learning_rate ${Actor_Lr} \
--critic_learning_rate ${Critic_Lr} \
--actor_weight_decay 0.1 \
--critic_weight_decay 0.1 \
--num_train_epochs 1 \
--lr_scheduler_type cosine \
--gradient_accumulation_steps 1 \
--num_warmup_steps 100 \
--deepspeed --seed 1234 \
${ACTOR_ZERO_STAGE} \
${CRITIC_ZERO_STAGE} \
--actor_lora_dim 128 \
--enable_hybrid_engine \
--actor_lora_module_name decoder.layers. \
--output_dir $OUTPUT_PATH \
&> $OUTPUT_PATH/training.log
@EikeKohl have you solved this problem, same Error
Hey @XiaoLaoDi not yet, but here is what I tried so far:
- Use a machine with more VRAM that should definitely be able to fit the model, to rule out any OOM issues
- Setup a new venv: py3.10, torch==2.0.0, transformers==4.28.1 and from git as specified in the requirements.txt (I tried both) on an EC2 instance that is not a Sagemaker instance and use pyenv instead of conda for handeling the venv
- Try a model from a different model family (Instead of GPT2, I tried OPT)
As for the CUDA setup, I tried
- the pre-installed versions available on AWS Sagemaker instances,
- running the training in the pytorch/pytorch docker container (pytorch/pytorch), and
- The amazon/Deep Learning AMI GPU PyTorch 2.0.0 (Ubuntu 20.04) 20230401 image for the EC2 instance
However, the error remains...
@XiaoLaoDi How does your setup look like? Maybe we can identify similiarities and possible problem areas
maybe see this issue https://github.com/microsoft/DeepSpeedExamples/issues/335#issuecomment-1521105300
@ruihan0495 thank you for the info. Not using the DeepSpeed-HE does indeed make a training possible 🙂. I run into another exception a little later in the code, but that is probably due to poor model quality. I am currently working on fixing that issue as well.
@EikeKohl I got the same error when I trained GPT2 and I also used AWS EC2 instance. Did you figure out what the problem is?
@DehongXu tbh I didn't use Deepspeed RLHF in a while, but I remember that there was a known issue with the hybrid engine that was supposed to be fixed in upcoming updates. Disabling the usage of the hybrid engine made it worked for me at that time.