nebuly
nebuly copied to clipboard
[Chatllama] RLHF training for Actor
When I was training the actor with reinforcement learning, I encountered the following bug:
Current device used :cuda
Start RL Training
Episode: 1 of 100, Timestep: 1 of 8
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [33,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [34,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [35,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [36,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [37,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [38,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [39,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [40,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [41,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [42,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [43,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [44,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [45,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [46,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [47,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [48,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [49,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [50,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [51,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [52,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [53,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [54,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [55,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [56,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [57,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [58,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [59,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [60,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [61,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [62,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [33,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [34,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [35,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [36,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [37,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [38,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [39,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [40,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [41,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [42,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [43,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [44,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [45,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [46,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [47,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [48,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [49,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [50,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [51,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [52,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [53,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [54,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [55,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [56,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [57,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [58,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [59,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [60,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [61,0,0] Assertion srcIndex < srcSelectDimSize failed.
values = self.critic.forward(sequences, sequences_mask)
File "<@beartype(chatllama.rlhf.reward.RewardModel.forward) at 0x2afb7867aaf0>", line 51, in forward
File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/chatllama/rlhf/reward.py", line 133, in forward
output = self.model(
File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 831, in forward
position_embeds = self.wpe(position_ids)
File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
return F.embedding(
File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/torch/nn/functional.py", line 2183, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered
my confilg.yaml: trainer_config: actor_lr: 0.00001 critic_lr: 0.00001 actor_eps_clip: 0.2 critic_eps_clip: 0.2 beta_s: 0.1 examples_path: "./datasets/rlhf_training_data.json.repair" num_episodes: 100 max_timesteps: 8 update_timesteps: 8 num_examples: 8 batch_size: 1 epochs: 1 update_checkpoint: 8 checkpoint_folder: "./models/checkpoints"
actor_config: model: "facebook/opt-125m" model_path: "path-to-model" checkpoint_folder: "./models" tokenizer_folder: "path-to-tokenizer" train_dataset_path: "./datasets/actor_training_data.json" validation_dataset_path: null froze_embeddings: True use_fairscale: False max_sequence_length: 2048 max_tokens: 1024 temperature: 0.9 batch_size: 6 iteration_per_print: 100 lr: 0.0001 epochs: 5 deepspeed_enable: False deepspeed_config_path: "path-to-deepspeed-conf"
reward_config: model: "gpt2-large" model_head_hidden_size: 2048 model_folder: "./models" train_dataset_path: "./datasets/reward_training_data.json" validation_dataset_path: null batch_size: 1 epochs: 32 iteration_per_print: 1 lr: 0.0001 deepspeed_enable: False deepspeed_config_path: "path-to-deepspeed-conf"
critic_config: model: "gpt2-large" model_head_hidden_size: 2048 model_folder: "./models" deepspeed_enable: False deepspeed_config_path: "path-to-deepspeed-conf"
Hi @Vincent131499 . Thank you for your feedback. We are testing the code to solve this problem which is probably due to the length of the model/data sequences. When we think we have solved the problem, we will write back to you so that you can test for yourself if the problem persists.
@PierpaoloSorbellini Same Problem, is this problem solved?
Current device used :cuda
Loading
Start RL Training
Episode: 1 of 100, Timestep: 1 of 32
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [175,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
Traceback (most recent call last):
File "artifacts/main.py", line 51, in <module>
rlhf_trainer.train()
File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/rlhf/trainer.py", line 655, in train
) = self.actorcritic.generate(states, states_mask)
File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "<@beartype(chatllama.rlhf.trainer.ActorCritic.generate) at 0x7f92593d4f70>", line 51, in generate
File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/rlhf/trainer.py", line 144, in generate
actions, sequence = self.actor.generate(states, state_mask)
File "<@beartype(chatllama.rlhf.actor.ActorModel.generate) at 0x7f925bd03160>", line 51, in generate
File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/rlhf/actor.py", line 163, in generate
sequences = self.model.generate(
File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/llama_model.py", line 529, in generate
logits = self._forward(input_ids, attention_mask)[:, -1, :]
File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/llama_model.py", line 503, in _forward
h, cache_k, cache_v = layer(
File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/llama_model.py", line 401, in forward
attn, cache_k, cache_v = self.attention.forward(
File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/llama_model.py", line 281, in forward
xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)
File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Hi @cokuehuang Yes, we have found the problem and will be releasing a fix for it very soon. We are trying to fix other issues as well to have a more stable code base to add more features to. Thanks for your patience, I will let you know when it is released.
Hi @cokuehuang You can try the PR #306 where the problem should have been addressed!