DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

what(): CUDA error: an illegal memory access was encountered

Open qinzhiliang opened this issue 1 year ago • 8 comments

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: an illegal memory access was encountered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1b1fd134d7 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f1b1fcdd36b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f1b1fdafb58 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #3: + 0x1c36b (0x7f1b1fd8036b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #4: + 0x2b930 (0x7f1b1fd8f930 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so) frame #5: + 0x4d56d6 (0x7f1b867306d6 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so) frame #6: + 0x3ee77 (0x7f1b1fcf8e77 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so) frame #7: c10::TensorImpl::copy_tensor_metadata_except_version_counter(c10::TensorImpl const*, c10::TensorImpl*, bool) + 0x41 (0x7f1b1fcf3391 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so) frame #8: c10::TensorImpl::copy_tensor_metadata(c10::TensorImpl const*, c10::TensorImpl*, c10::VariableVersion const&, bool) + 0x14 (0x7f1b1fcf3404 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)

step1: deepspeed main.py
--data_path bote/gpt_part_data
--data_split 2,4,4
--model_name_or_path FreedomIntelligence/phoenix-inst-chat-7b
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--max_seq_len 512
--learning_rate 9.65e-6
--weight_decay 0.
--num_train_epochs 16
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--num_warmup_steps 0
--seed 1234
--gradient_checkpointing
--zero_stage $ZERO_STAGE
--deepspeed
--output_dir $OUTPUT
2>&1 | tee $OUTPUT/training.log

step2: deepspeed main.py
--data_path bote/whoareyou
--data_split 2,4,4
--model_name_or_path bigscience/bloomz-560m
--num_padding_at_beginning 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--max_seq_len 512
--learning_rate 5e-5
--weight_decay 0.1
--num_train_epochs 1
--disable_dropout
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--num_warmup_steps 0
--seed 1234
--zero_stage $ZERO_STAGE
--deepspeed
--output_dir $OUTPUT
2>&1 | tee $OUTPUT/training.log

step3: deepspeed --master_port 12346 main.py
--data_path bote/gpt_part_data
--data_split 2,4,4
--actor_model_name_or_path $ACTOR_MODEL_PATH
--critic_model_name_or_path $CRITIC_MODEL_PATH
--num_padding_at_beginning 1
--per_device_train_batch_size 1
--per_device_mini_train_batch_size 1
--generation_batch_numbers 1
--ppo_epochs 1
--max_answer_seq_len 256
--max_prompt_seq_len 256
--actor_learning_rate ${Actor_Lr}
--critic_learning_rate ${Critic_Lr}
--actor_weight_decay 0.1
--critic_weight_decay 0.1
--num_train_epochs 1
--lr_scheduler_type cosine
--gradient_accumulation_steps 1
--actor_gradient_checkpointing
--disable_actor_dropout
--num_warmup_steps 100
--deepspeed --seed 1234
--enable_hybrid_engine
--actor_zero_stage $ACTOR_ZERO_STAGE
--critic_zero_stage $CRITIC_ZERO_STAGE
--output_dir $OUTPUT
2>&1 | tee $OUTPUT/training.log

GPU: 8 * A40(48G)

qinzhiliang avatar Jun 13 '23 08:06 qinzhiliang

same, has solved ?

HalcyonLiang avatar Jun 14 '23 07:06 HalcyonLiang

same

beichengus avatar Jun 27 '23 02:06 beichengus

same,when I use Zero3 and the error is occur, and I run correctly in Zero1

NostalgiaOfTime avatar Jul 03 '23 07:07 NostalgiaOfTime

Also experiencing this regardless of zero stage. Has anyone found a work around?

Edit: moving from deepspeed 0.9.3 -> 0.9.5 seems to have resolved my issue.

rgxb2807 avatar Jul 21 '23 21:07 rgxb2807

I am getting this error with deepspeed 0.10.3

adibMosharrof avatar Oct 03 '23 21:10 adibMosharrof

I am getting this error with deepspeed 0.10.3

Is the error solved? bro.

Luoxiaohei41 avatar Nov 04 '23 14:11 Luoxiaohei41

I am getting this error with deepspeed 0.10.3

bro, Has the problem been solved

Luoxiaohei41 avatar Nov 14 '23 03:11 Luoxiaohei41

same

WooooDyy avatar Dec 16 '23 15:12 WooooDyy