DeepSpeedExamples
DeepSpeedExamples copied to clipboard
what(): CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f1b1fd134d7 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f1b1fcdd36b in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f1b1fdafb58 in /root/miniconda3/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3:
step1:
deepspeed main.py
--data_path bote/gpt_part_data
--data_split 2,4,4
--model_name_or_path FreedomIntelligence/phoenix-inst-chat-7b
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--max_seq_len 512
--learning_rate 9.65e-6
--weight_decay 0.
--num_train_epochs 16
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--num_warmup_steps 0
--seed 1234
--gradient_checkpointing
--zero_stage $ZERO_STAGE
--deepspeed
--output_dir $OUTPUT
2>&1 | tee $OUTPUT/training.log
step2:
deepspeed main.py
--data_path bote/whoareyou
--data_split 2,4,4
--model_name_or_path bigscience/bloomz-560m
--num_padding_at_beginning 1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--max_seq_len 512
--learning_rate 5e-5
--weight_decay 0.1
--num_train_epochs 1
--disable_dropout
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--num_warmup_steps 0
--seed 1234
--zero_stage $ZERO_STAGE
--deepspeed
--output_dir $OUTPUT
2>&1 | tee $OUTPUT/training.log
step3:
deepspeed --master_port 12346 main.py
--data_path bote/gpt_part_data
--data_split 2,4,4
--actor_model_name_or_path $ACTOR_MODEL_PATH
--critic_model_name_or_path $CRITIC_MODEL_PATH
--num_padding_at_beginning 1
--per_device_train_batch_size 1
--per_device_mini_train_batch_size 1
--generation_batch_numbers 1
--ppo_epochs 1
--max_answer_seq_len 256
--max_prompt_seq_len 256
--actor_learning_rate ${Actor_Lr}
--critic_learning_rate ${Critic_Lr}
--actor_weight_decay 0.1
--critic_weight_decay 0.1
--num_train_epochs 1
--lr_scheduler_type cosine
--gradient_accumulation_steps 1
--actor_gradient_checkpointing
--disable_actor_dropout
--num_warmup_steps 100
--deepspeed --seed 1234
--enable_hybrid_engine
--actor_zero_stage $ACTOR_ZERO_STAGE
--critic_zero_stage $CRITIC_ZERO_STAGE
--output_dir $OUTPUT
2>&1 | tee $OUTPUT/training.log
GPU: 8 * A40(48G)
same, has solved ?
same
same,when I use Zero3 and the error is occur, and I run correctly in Zero1
Also experiencing this regardless of zero stage. Has anyone found a work around?
Edit: moving from deepspeed 0.9.3 -> 0.9.5 seems to have resolved my issue.
I am getting this error with deepspeed 0.10.3
I am getting this error with deepspeed 0.10.3
Is the error solved? bro.
I am getting this error with deepspeed 0.10.3
bro, Has the problem been solved
same