ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

sh train_rm.sh bug

Open MountainHolder opened this issue 2 years ago • 1 comments

🐛 Describe the bug

执行export CUDA_VISIBLE_DEVICES=0 torchrun --standalone --nproc_per_node=1 train_reward_model.py
--pretrain /root/llama/Coati-7B
--model 'llama'
--strategy naive
--loss_fn 'log_sig'
--save_path /root/llama/llama-reward
--dataset 'Anthropic/hh-rlhf'
--batch_size 1
--max_epochs 1 \

报错如下: ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [40,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [41,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [42,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [43,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [44,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [45,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [46,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [47,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [48,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [49,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [50,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [51,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [52,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [53,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [54,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [55,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [56,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [57,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [58,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [59,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [60,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [61,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [62,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1141: indexSelectLargeIndex: block: [118,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed. Traceback (most recent call last): File "/workspace/ColossalAI/applications/Chat/examples/train_reward_model.py", line 220, in train(args) File "/workspace/ColossalAI/applications/Chat/examples/train_reward_model.py", line 187, in train trainer.fit() File "/opt/conda/lib/python3.9/site-packages/coati/trainer/rm.py", line 98, in fit chosen_reward = self.model(chosen_ids, attention_mask=c_mask) File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.9/site-packages/coati/models/base/reward_model.py", line 37, in forward outputs = self.model(sequences, attention_mask=attention_mask) File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 922, in forward layer_outputs = decoder_layer( File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 366, in forward query_states = self.q_proj(hidden_states) File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle)

Environment

No response

MountainHolder avatar Dec 27 '23 11:12 MountainHolder

Hi, can you pull the latest code? It seems that the naive strategy has been removed.

flybird11111 avatar Jan 22 '24 02:01 flybird11111