peft icon indicating copy to clipboard operation
peft copied to clipboard

CUDA Error when fine tuning GPT-J for CasualLM

Open JohnnyRacer opened this issue 2 years ago • 6 comments

Hello, I am trying to finetune GPT-J for text generation by adapting this notebook. However, when I run trainer.train I get a CUDA error that states the following, RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling 'cublasCreate(handle)' The error seems to originating from ./peft/src/peft/tuners/ line 277 from the traceback. Any ideas why this is happening or how to fix it?

The full traceback is below :

RuntimeError                              Traceback (most recent call last)
Cell In[14], line 1
----> 1 trainer.train()

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1538     self.model_wrapped = self.model
   1540 inner_training_loop = find_executable_batch_size(
   1541     self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
   1542 )
-> 1543 return inner_training_loop(
   1544     args=args,
   1545     resume_from_checkpoint=resume_from_checkpoint,
   1546     trial=trial,
   1547     ignore_keys_for_eval=ignore_keys_for_eval,
   1548 )

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/accelerate/utils/, in find_executable_batch_size.<locals>.decorator(*args, **kwargs)
    122     raise RuntimeError("No executable batch size found, reached zero.")
    123 try:
--> 124     return function(batch_size, *args, **kwargs)
    125 except Exception as e:
    126     if should_reduce_batch_size(e):

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1789         tr_loss_step = self.training_step(model, inputs)
   1790 else:
-> 1791     tr_loss_step = self.training_step(model, inputs)
   1793 if (
   1794     args.logging_nan_inf_filter
   1795     and not is_torch_tpu_available()
   1796     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   1797 ):
   1798     # if loss is nan or inf simply add the average of previous logged losses
   1799     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/, in Trainer.training_step(self, model, inputs)
   2536     return loss_mb.reduce_mean().detach().to(self.args.device)
   2538 with self.compute_loss_context_manager():
-> 2539     loss = self.compute_loss(model, inputs)
   2541 if self.args.n_gpu > 1:
   2542     loss = loss.mean()  # mean() to average on multi-gpu parallel training

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/, in Trainer.compute_loss(self, model, inputs, return_outputs)
   2569 else:
   2570     labels = None
-> 2571 outputs = model(**inputs)
   2572 # Save past state if it exists
   2573 # TODO: this needs to be fixed and made cleaner later.
   2574 if self.args.past_index >= 0:

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

***** Running training *****
  Num examples = 36139
  Num Epochs = 6
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 4
  Total optimization steps = 6774
  Number of trainable parameters = 7340032
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/utils/ UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")

RuntimeError                              Traceback (most recent call last)
Cell In[14], line 1
----> 1 trainer.train()

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1538     self.model_wrapped = self.model
   1540 inner_training_loop = find_executable_batch_size(
   1541     self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
   1542 )
-> 1543 return inner_training_loop(
   1544     args=args,
   1545     resume_from_checkpoint=resume_from_checkpoint,
   1546     trial=trial,
   1547     ignore_keys_for_eval=ignore_keys_for_eval,
   1548 )

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/accelerate/utils/, in find_executable_batch_size.<locals>.decorator(*args, **kwargs)
    122     raise RuntimeError("No executable batch size found, reached zero.")
    123 try:
--> 124     return function(batch_size, *args, **kwargs)
    125 except Exception as e:
    126     if should_reduce_batch_size(e):

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1789         tr_loss_step = self.training_step(model, inputs)
   1790 else:
-> 1791     tr_loss_step = self.training_step(model, inputs)
   1793 if (
   1794     args.logging_nan_inf_filter
   1795     and not is_torch_tpu_available()
   1796     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   1797 ):
   1798     # if loss is nan or inf simply add the average of previous logged losses
   1799     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/, in Trainer.training_step(self, model, inputs)
   2536     return loss_mb.reduce_mean().detach().to(self.args.device)
   2538 with self.compute_loss_context_manager():
-> 2539     loss = self.compute_loss(model, inputs)
   2541 if self.args.n_gpu > 1:
   2542     loss = loss.mean()  # mean() to average on multi-gpu parallel training

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/, in Trainer.compute_loss(self, model, inputs, return_outputs)
   2569 else:
   2570     labels = None
-> 2571 outputs = model(**inputs)
   2572 # Save past state if it exists
   2573 # TODO: this needs to be fixed and made cleaner later.
   2574 if self.args.past_index >= 0:

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/nlp/peft/src/peft/, in PeftModelForCausalLM.forward(self, input_ids, attention_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict, **kwargs)
    490 def forward(
    491     self,
    492     input_ids=None,
    499     **kwargs,
    500 ):
    501     if not isinstance(self.peft_config, PromptLearningConfig):
--> 502         return self.base_model(
    503             input_ids=input_ids,
    504             attention_mask=attention_mask,
    505             inputs_embeds=inputs_embeds,
    506             labels=labels,
    507             output_attentions=output_attentions,
    508             output_hidden_states=output_hidden_states,
    509             return_dict=return_dict,
    510             **kwargs,
    511         )
    513     batch_size = input_ids.shape[0]
    514     if attention_mask is not None:
    515         # concat prompt attention mask

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/models/gptj/, in GPTJForCausalLM.forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
    805 r"""
    806 labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
    807     Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
    808     `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
    809     are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]`
    810 """
    811 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
--> 813 transformer_outputs = self.transformer(
    814     input_ids,
    815     past_key_values=past_key_values,
    816     attention_mask=attention_mask,
    817     token_type_ids=token_type_ids,
    818     position_ids=position_ids,
    819     head_mask=head_mask,
    820     inputs_embeds=inputs_embeds,
    821     use_cache=use_cache,
    822     output_attentions=output_attentions,
    823     output_hidden_states=output_hidden_states,
    824     return_dict=return_dict,
    825 )
    826 hidden_states = transformer_outputs[0]
    828 # Set device for model parallelism

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/models/gptj/, in GPTJModel.forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
    656             return module(*inputs, use_cache, output_attentions)
    658         return custom_forward
--> 660     outputs = torch.utils.checkpoint.checkpoint(
    661         create_custom_forward(block),
    662         hidden_states,
    663         None,
    664         attention_mask,
    665         head_mask[i],
    666     )
    667 else:
    668     outputs = block(
    669         hidden_states,
    670         layer_past=layer_past,
    674         output_attentions=output_attentions,
    675     )

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/utils/, in checkpoint(function, use_reentrant, *args, **kwargs)
    246     raise ValueError("Unexpected keyword arguments: " + ",".join(arg for arg in kwargs))
    248 if use_reentrant:
--> 249     return CheckpointFunction.apply(function, preserve, *args)
    250 else:
    251     return _checkpoint_without_reentrant(
    252         function,
    253         preserve,
    254         *args,
    255         **kwargs,
    256     )

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/utils/, in CheckpointFunction.forward(ctx, run_function, preserve_rng_state, *args)
    104 ctx.save_for_backward(*tensor_inputs)
    106 with torch.no_grad():
--> 107     outputs = run_function(*args)
    108 return outputs

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/models/gptj/, in GPTJModel.forward.<locals>.create_custom_forward.<locals>.custom_forward(*inputs)
    654 def custom_forward(*inputs):
    655     # None for past_key_value
--> 656     return module(*inputs, use_cache, output_attentions)

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/models/gptj/, in GPTJBlock.forward(self, hidden_states, layer_past, attention_mask, head_mask, use_cache, output_attentions)
    300 residual = hidden_states
    301 hidden_states = self.ln_1(hidden_states)
--> 302 attn_outputs = self.attn(
    303     hidden_states,
    304     layer_past=layer_past,
    305     attention_mask=attention_mask,
    306     head_mask=head_mask,
    307     use_cache=use_cache,
    308     output_attentions=output_attentions,
    309 )
    310 attn_output = attn_outputs[0]  # output_attn: a, present, (attentions)
    311 outputs = attn_outputs[1:]

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/models/gptj/, in GPTJAttention.forward(self, hidden_states, attention_mask, layer_past, head_mask, use_cache, output_attentions)
    190 def forward(
    191     self,
    192     hidden_states: Optional[torch.FloatTensor],
    200     Optional[Tuple[torch.Tensor, Tuple[torch.Tensor], Tuple[torch.Tensor, ...]]],
    201 ]:
--> 203     query = self.q_proj(hidden_states)
    204     key = self.k_proj(hidden_states)
    205     value = self.v_proj(hidden_states)

File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/nlp/peft/src/peft/tuners/, in Linear.forward(self, x)
    275 def forward(self, x: torch.Tensor):
    276     if self.r > 0 and not self.merged:
--> 277         result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
    278         if self.r > 0:
    279             result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

JohnnyRacer avatar Feb 12 '23 03:02 JohnnyRacer

I can also confirm that finetuning a Seq2SeqLM model works fine, and there are no problems when I test PyTorch functions that uses CUBLAS such as the following snippet:

import torch
import numpy as np
import time

flatten_masks = np.random.random((800, 60800))

flatten_masks = torch.from_numpy(flatten_masks).cuda(device=0)
t1 = time.time()
i = 0
while i < 2500:
    if i == 500:
        t1 = time.time()
    inter_matrix =, flatten_masks.transpose(1, 0))
    i += 1
t2 = time.time()

JohnnyRacer avatar Feb 12 '23 04:02 JohnnyRacer

Hi @JohnnyRacer Thanks for your interest in this! Would be able to share with us the full notebook so that we can try to reproduce? Also can you run the script on CPU first, I suspect there is some label mismatching / index out of range at some point (see a similar issue here )

i.e. something badly configured when initialized the LoRA layers

Maybe you can just share the way you have initialized your model with peft ?

younesbelkada avatar Feb 12 '23 09:02 younesbelkada

Hi @JohnnyRacer Thanks for your interest in this! Would be able to share with us the full notebook so that we can try to reproduce? Also can you run the script on CPU first, I suspect there is some label mismatching / index out of range at some point (see a similar issue here )

i.e. something badly configured when initialized the LoRA layers

Maybe you can just share the way you have initialized your model with peft ?


Hello, here is the link to the Colab notebook . Unfortunately my system has insufficient RAM to load the model for training on the CPU, for some reason peft requires much more system RAM than VRAM (not sure if this intentional or a bug). Also if it's any help, my transformers version is 4.26.1 and peft is version 0.1.0.dev0, the code I am using to initialize the model with peft is below:

from peft import LoraConfig, get_peft_model,TaskType

def print_trainable_parameters(model):
    Prints the number of trainable parameters in the model.
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"

lora_config = LoraConfig(
    target_modules=["q_proj", "v_proj"],

model = get_peft_model(model, lora_config)

JohnnyRacer avatar Feb 12 '23 10:02 JohnnyRacer

Hello, I'm able to run it with INT8 training without any issues. Here is the colab notebook:

pacman100 avatar Feb 14 '23 08:02 pacman100

Hi @pacman100 ,

In the colab that you shared appears 'No log' for the training loss. Why does that happen?


milyiyo avatar Mar 29 '23 16:03 milyiyo

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

github-actions[bot] avatar Apr 23 '23 15:04 github-actions[bot]