LLaMA-Factory 8卡4090训练报错：RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

39%|███▉ | 770/1968 [1:53:40<2:54:12, 8.73s/it]../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [606,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [606,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [606,0,0], thread: [33,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [606,0,0], thread: [34,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [606,0,0], thread: [35,0,0] Assertion srcIndex < srcSelectDimSize failed. Traceback (most recent call last): File "/ML-A800/home/guoshuyue/madehua/code/LLaMA-Factory/src/train_bash.py", line 15, in main() File "/ML-A800/home/guoshuyue/madehua/code/LLaMA-Factory/src/train_bash.py", line 6, in main run_exp() File "/ML-A800/home/guoshuyue/madehua/code/LLaMA-Factory/src/llmtuner/train/tuner.py", line 40, in run_exp run_rm(model_args, data_args, training_args, finetuning_args, callbacks) File "/ML-A800/home/guoshuyue/madehua/code/LLaMA-Factory/src/llmtuner/train/rm/workflow.py", line 50, in run_rm train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) File "/ML-A800/home/guoshuyue/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train return inner_training_loop( File "/ML-A800/home/guoshuyue/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 2118, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/ML-A800/home/guoshuyue/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/trainer.py", line 3036, in training_step loss = self.compute_loss(model, inputs) File "/ML-A800/home/guoshuyue/madehua/code/LLaMA-Factory/src/llmtuner/train/rm/trainer.py", line 51, in compute_loss _, _, values = model(**inputs, output_hidden_states=True, return_dict=True) File "/ML-A800/home/guoshuyue/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/ML-A800/home/guoshuyue/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/ML-A800/home/guoshuyue/anaconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/ML-A800/home/guoshuyue/anaconda3/envs/llama_factory/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1852, in forward loss = self.module(*inputs, **kwargs) File "/ML-A800/home/guoshuyue/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/ML-A800/home/guoshuyue/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1561, in _call_impl result = forward_call(*args, **kwargs) File "/ML-A800/home/guoshuyue/anaconda3/envs/llama_factory/lib/python3.10/site-packages/trl/models/modeling_value_head.py", line 170, in forward base_model_output = self.pretrained_model( File "/ML-A800/home/guoshuyue/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/ML-A800/home/guoshuyue/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1561, in _call_impl result = forward_call(*args, **kwargs) File "/ML-A800/home/guoshuyue/anaconda3/envs/llama_factory/lib/python3.10/site-packages/peft/peft_model.py", line 1129, in forward return self.base_model( File "/ML-A800/home/guoshuyue/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/ML-A800/home/guoshuyue/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1561, in _call_impl result = forward_call(*args, **kwargs) File "/ML-A800/home/guoshuyue/anaconda3/envs/llama_factory/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 161, in forward return self.model.forward(*args, **kwargs) File "/ML-A800/home/guoshuyue/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1196, in forward outputs = self.model( File "/ML-A800/home/guoshuyue/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/ML-A800/home/guoshuyue/anaconda3/envs/llama_factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1561, in _call_impl result = forward_call(*args, **kwargs) File "/ML-A800/home/guoshuyue/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 990, in forward causal_mask = self._update_causal_mask(attention_mask, inputs_embeds, cache_position) File "/ML-A800/home/guoshuyue/anaconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1109, in _update_causal_mask if not is_tracing and torch.any(attention_mask != 1): RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

我阅读了issue#1224，但我使用deepspeed进行启动，每次都运行到相同位置报错，麻烦大家解答一下，谢谢 deepspeed --include localhost:$GPUs --master_port=2221 -- train_bash.py
--deepspeed $ds_config_path
--ddp_timeout 180000000
--stage rm
--do_train
--do_eval
--model_name_or_path $model_path
--adapter_name_or_path $path_to_sft_checkpoint
--create_new_adapter
--dataset $dataset_path
--dataset_dir $dataset_dir
--template default
--finetuning_type lora
--lora_target q_proj,v_proj
--lora_rank 64
--output_dir $path_to_rm_checkpoint
--overwrite_output_dir
--overwrite_cache
--per_device_train_batch_size 2
--per_device_eval_batch_size 4
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--logging_steps 10
--save_strategy "epoch"
--dataloader_num_workers 4
--learning_rate 5e-5
--num_train_epochs 2.0
--plot_loss
--val_size 1000
--seed 0
--bf16

Expected behavior

No response

System Info

No response

Others

No response

Apr 10 '24 07:04 madehua98

什么模型？用 --resize_vocab 参数试试？

Apr 10 '24 16:04 hiyouga

什么模型？用 --resize_vocab 参数试试？

使用的模型是llama-2-7b-ms，--resize_vocab对于解决这个问题是有帮助的吗？

Apr 11 '24 06:04 madehua98

LLaMA-Factory LLaMA-Factory copied to clipboard

Reminder

Reproduction

Expected behavior

System Info

Others

LLaMA-Factory
LLaMA-Factory copied to clipboard