nebuly icon indicating copy to clipboard operation
nebuly copied to clipboard

[Chatllama] RLHF training for Actor

Open Vincent131499 opened this issue 2 years ago • 4 comments

When I was training the actor with reinforcement learning, I encountered the following bug: Current device used :cuda Start RL Training Episode: 1 of 100, Timestep: 1 of 8 ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [33,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [34,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [35,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [36,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [37,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [38,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [39,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [40,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [41,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [42,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [43,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [44,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [45,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [46,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [47,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [48,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [49,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [50,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [51,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [52,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [53,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [54,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [55,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [56,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [57,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [58,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [59,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [60,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [61,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [62,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [113,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [32,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [33,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [34,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [35,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [36,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [37,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [38,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [39,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [40,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [41,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [42,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [43,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [44,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [45,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [46,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [47,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [48,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [49,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [50,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [51,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [52,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [53,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [54,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [55,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [56,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [57,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [58,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [59,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [60,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:703: indexSelectLargeIndex: block: [148,0,0], thread: [61,0,0] Assertion srcIndex < srcSelectDimSize failed. values = self.critic.forward(sequences, sequences_mask) File "<@beartype(chatllama.rlhf.reward.RewardModel.forward) at 0x2afb7867aaf0>", line 51, in forward File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/chatllama/rlhf/reward.py", line 133, in forward output = self.model( File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 831, in forward position_embeds = self.wpe(position_ids) File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward return F.embedding( File "/mnt/lustre02/jiangsu/aispeech/home/gfl18/.conda/envs/py38-llm/lib/python3.8/site-packages/torch/nn/functional.py", line 2183, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: CUDA error: device-side assert triggered

my confilg.yaml: trainer_config: actor_lr: 0.00001 critic_lr: 0.00001 actor_eps_clip: 0.2 critic_eps_clip: 0.2 beta_s: 0.1 examples_path: "./datasets/rlhf_training_data.json.repair" num_episodes: 100 max_timesteps: 8 update_timesteps: 8 num_examples: 8 batch_size: 1 epochs: 1 update_checkpoint: 8 checkpoint_folder: "./models/checkpoints"

actor_config: model: "facebook/opt-125m" model_path: "path-to-model" checkpoint_folder: "./models" tokenizer_folder: "path-to-tokenizer" train_dataset_path: "./datasets/actor_training_data.json" validation_dataset_path: null froze_embeddings: True use_fairscale: False max_sequence_length: 2048 max_tokens: 1024 temperature: 0.9 batch_size: 6 iteration_per_print: 100 lr: 0.0001 epochs: 5 deepspeed_enable: False deepspeed_config_path: "path-to-deepspeed-conf"

reward_config: model: "gpt2-large" model_head_hidden_size: 2048 model_folder: "./models" train_dataset_path: "./datasets/reward_training_data.json" validation_dataset_path: null batch_size: 1 epochs: 32 iteration_per_print: 1 lr: 0.0001 deepspeed_enable: False deepspeed_config_path: "path-to-deepspeed-conf"

critic_config: model: "gpt2-large" model_head_hidden_size: 2048 model_folder: "./models" deepspeed_enable: False deepspeed_config_path: "path-to-deepspeed-conf"

Vincent131499 avatar Mar 10 '23 05:03 Vincent131499

Hi @Vincent131499 . Thank you for your feedback. We are testing the code to solve this problem which is probably due to the length of the model/data sequences. When we think we have solved the problem, we will write back to you so that you can test for yourself if the problem persists.

PierpaoloSorbellini avatar Mar 10 '23 14:03 PierpaoloSorbellini

@PierpaoloSorbellini Same Problem, is this problem solved?

Current device used :cuda
Loading
Start RL Training
Episode: 1 of 100, Timestep: 1 of 32
../aten/src/ATen/native/cuda/Indexing.cu:975: indexSelectLargeIndex: block: [175,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
...
Traceback (most recent call last):
  File "artifacts/main.py", line 51, in <module>
    rlhf_trainer.train()
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/rlhf/trainer.py", line 655, in train
    ) = self.actorcritic.generate(states, states_mask)
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "<@beartype(chatllama.rlhf.trainer.ActorCritic.generate) at 0x7f92593d4f70>", line 51, in generate
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/rlhf/trainer.py", line 144, in generate
    actions, sequence = self.actor.generate(states, state_mask)
  File "<@beartype(chatllama.rlhf.actor.ActorModel.generate) at 0x7f925bd03160>", line 51, in generate
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/rlhf/actor.py", line 163, in generate
    sequences = self.model.generate(
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/llama_model.py", line 529, in generate
    logits = self._forward(input_ids, attention_mask)[:, -1, :]
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/llama_model.py", line 503, in _forward
    h, cache_k, cache_v = layer(
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/llama_model.py", line 401, in forward
    attn, cache_k, cache_v = self.attention.forward(
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/chatllama/llama_model.py", line 281, in forward
    xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/alpa/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

cokuehuang avatar Mar 13 '23 09:03 cokuehuang

Hi @cokuehuang Yes, we have found the problem and will be releasing a fix for it very soon. We are trying to fix other issues as well to have a more stable code base to add more features to. Thanks for your patience, I will let you know when it is released.

PierpaoloSorbellini avatar Mar 14 '23 08:03 PierpaoloSorbellini

Hi @cokuehuang You can try the PR #306 where the problem should have been addressed!

PierpaoloSorbellini avatar Apr 03 '23 14:04 PierpaoloSorbellini