peft
peft copied to clipboard
CUDA Error when fine tuning GPT-J for CasualLM
Hello, I am trying to finetune GPT-J for text generation by adapting this notebook. However, when I run trainer.train
I get a CUDA error that states the following, RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling 'cublasCreate(handle)'
The error seems to originating from ./peft/src/peft/tuners/lora.py
line 277 from the traceback. Any ideas why this is happening or how to fix it?
The full traceback is below :
RuntimeError Traceback (most recent call last)
Cell In[14], line 1
----> 1 trainer.train()
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:1543, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1538 self.model_wrapped = self.model
1540 inner_training_loop = find_executable_batch_size(
1541 self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
1542 )
-> 1543 return inner_training_loop(
1544 args=args,
1545 resume_from_checkpoint=resume_from_checkpoint,
1546 trial=trial,
1547 ignore_keys_for_eval=ignore_keys_for_eval,
1548 )
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/accelerate/utils/memory.py:124, in find_executable_batch_size.<locals>.decorator(*args, **kwargs)
122 raise RuntimeError("No executable batch size found, reached zero.")
123 try:
--> 124 return function(batch_size, *args, **kwargs)
125 except Exception as e:
126 if should_reduce_batch_size(e):
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:1791, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1789 tr_loss_step = self.training_step(model, inputs)
1790 else:
-> 1791 tr_loss_step = self.training_step(model, inputs)
1793 if (
1794 args.logging_nan_inf_filter
1795 and not is_torch_tpu_available()
1796 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
1797 ):
1798 # if loss is nan or inf simply add the average of previous logged losses
1799 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:2539, in Trainer.training_step(self, model, inputs)
2536 return loss_mb.reduce_mean().detach().to(self.args.device)
2538 with self.compute_loss_context_manager():
-> 2539 loss = self.compute_loss(model, inputs)
2541 if self.args.n_gpu > 1:
2542 loss = loss.mean() # mean() to average on multi-gpu parallel training
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:2571, in Trainer.compute_loss(self, model, inputs, return_outputs)
2569 else:
2570 labels = None
-> 2571 outputs = model(**inputs)
2572 # Save past state if it exists
2573 # TODO: this needs to be fixed and made cleaner later.
2574 if self.args.past_index >= 0:
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
***** Running training *****
Num examples = 36139
Num Epochs = 6
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 32
Gradient Accumulation steps = 4
Total optimization steps = 6774
Number of trainable parameters = 7340032
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[14], line 1
----> 1 trainer.train()
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:1543, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1538 self.model_wrapped = self.model
1540 inner_training_loop = find_executable_batch_size(
1541 self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size
1542 )
-> 1543 return inner_training_loop(
1544 args=args,
1545 resume_from_checkpoint=resume_from_checkpoint,
1546 trial=trial,
1547 ignore_keys_for_eval=ignore_keys_for_eval,
1548 )
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/accelerate/utils/memory.py:124, in find_executable_batch_size.<locals>.decorator(*args, **kwargs)
122 raise RuntimeError("No executable batch size found, reached zero.")
123 try:
--> 124 return function(batch_size, *args, **kwargs)
125 except Exception as e:
126 if should_reduce_batch_size(e):
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:1791, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1789 tr_loss_step = self.training_step(model, inputs)
1790 else:
-> 1791 tr_loss_step = self.training_step(model, inputs)
1793 if (
1794 args.logging_nan_inf_filter
1795 and not is_torch_tpu_available()
1796 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
1797 ):
1798 # if loss is nan or inf simply add the average of previous logged losses
1799 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:2539, in Trainer.training_step(self, model, inputs)
2536 return loss_mb.reduce_mean().detach().to(self.args.device)
2538 with self.compute_loss_context_manager():
-> 2539 loss = self.compute_loss(model, inputs)
2541 if self.args.n_gpu > 1:
2542 loss = loss.mean() # mean() to average on multi-gpu parallel training
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/trainer.py:2571, in Trainer.compute_loss(self, model, inputs, return_outputs)
2569 else:
2570 labels = None
-> 2571 outputs = model(**inputs)
2572 # Save past state if it exists
2573 # TODO: this needs to be fixed and made cleaner later.
2574 if self.args.past_index >= 0:
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File ~/nlp/peft/src/peft/peft_model.py:502, in PeftModelForCausalLM.forward(self, input_ids, attention_mask, inputs_embeds, labels, output_attentions, output_hidden_states, return_dict, **kwargs)
490 def forward(
491 self,
492 input_ids=None,
(...)
499 **kwargs,
500 ):
501 if not isinstance(self.peft_config, PromptLearningConfig):
--> 502 return self.base_model(
503 input_ids=input_ids,
504 attention_mask=attention_mask,
505 inputs_embeds=inputs_embeds,
506 labels=labels,
507 output_attentions=output_attentions,
508 output_hidden_states=output_hidden_states,
509 return_dict=return_dict,
510 **kwargs,
511 )
513 batch_size = input_ids.shape[0]
514 if attention_mask is not None:
515 # concat prompt attention mask
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py:813, in GPTJForCausalLM.forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
805 r"""
806 labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
807 Labels for language modeling. Note that the labels **are shifted** inside the model, i.e. you can set
808 `labels = input_ids` Indices are selected in `[-100, 0, ..., config.vocab_size]` All labels set to `-100`
809 are ignored (masked), the loss is only computed for labels in `[0, ..., config.vocab_size]`
810 """
811 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
--> 813 transformer_outputs = self.transformer(
814 input_ids,
815 past_key_values=past_key_values,
816 attention_mask=attention_mask,
817 token_type_ids=token_type_ids,
818 position_ids=position_ids,
819 head_mask=head_mask,
820 inputs_embeds=inputs_embeds,
821 use_cache=use_cache,
822 output_attentions=output_attentions,
823 output_hidden_states=output_hidden_states,
824 return_dict=return_dict,
825 )
826 hidden_states = transformer_outputs[0]
828 # Set device for model parallelism
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py:660, in GPTJModel.forward(self, input_ids, past_key_values, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
656 return module(*inputs, use_cache, output_attentions)
658 return custom_forward
--> 660 outputs = torch.utils.checkpoint.checkpoint(
661 create_custom_forward(block),
662 hidden_states,
663 None,
664 attention_mask,
665 head_mask[i],
666 )
667 else:
668 outputs = block(
669 hidden_states,
670 layer_past=layer_past,
(...)
674 output_attentions=output_attentions,
675 )
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/utils/checkpoint.py:249, in checkpoint(function, use_reentrant, *args, **kwargs)
246 raise ValueError("Unexpected keyword arguments: " + ",".join(arg for arg in kwargs))
248 if use_reentrant:
--> 249 return CheckpointFunction.apply(function, preserve, *args)
250 else:
251 return _checkpoint_without_reentrant(
252 function,
253 preserve,
254 *args,
255 **kwargs,
256 )
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/utils/checkpoint.py:107, in CheckpointFunction.forward(ctx, run_function, preserve_rng_state, *args)
104 ctx.save_for_backward(*tensor_inputs)
106 with torch.no_grad():
--> 107 outputs = run_function(*args)
108 return outputs
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py:656, in GPTJModel.forward.<locals>.create_custom_forward.<locals>.custom_forward(*inputs)
654 def custom_forward(*inputs):
655 # None for past_key_value
--> 656 return module(*inputs, use_cache, output_attentions)
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py:302, in GPTJBlock.forward(self, hidden_states, layer_past, attention_mask, head_mask, use_cache, output_attentions)
300 residual = hidden_states
301 hidden_states = self.ln_1(hidden_states)
--> 302 attn_outputs = self.attn(
303 hidden_states,
304 layer_past=layer_past,
305 attention_mask=attention_mask,
306 head_mask=head_mask,
307 use_cache=use_cache,
308 output_attentions=output_attentions,
309 )
310 attn_output = attn_outputs[0] # output_attn: a, present, (attentions)
311 outputs = attn_outputs[1:]
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/transformers/models/gptj/modeling_gptj.py:203, in GPTJAttention.forward(self, hidden_states, attention_mask, layer_past, head_mask, use_cache, output_attentions)
190 def forward(
191 self,
192 hidden_states: Optional[torch.FloatTensor],
(...)
200 Optional[Tuple[torch.Tensor, Tuple[torch.Tensor], Tuple[torch.Tensor, ...]]],
201 ]:
--> 203 query = self.q_proj(hidden_states)
204 key = self.k_proj(hidden_states)
205 value = self.v_proj(hidden_states)
File ~/miniconda3/envs/ltstrsf/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File ~/nlp/peft/src/peft/tuners/lora.py:277, in Linear.forward(self, x)
275 def forward(self, x: torch.Tensor):
276 if self.r > 0 and not self.merged:
--> 277 result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
278 if self.r > 0:
279 result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
I can also confirm that finetuning a Seq2SeqLM model works fine, and there are no problems when I test PyTorch functions that uses CUBLAS such as the following snippet:
import torch
import numpy as np
import time
flatten_masks = np.random.random((800, 60800))
flatten_masks = torch.from_numpy(flatten_masks).cuda(device=0)
print()
t1 = time.time()
i = 0
while i < 2500:
if i == 500:
t1 = time.time()
inter_matrix = torch.mm(flatten_masks, flatten_masks.transpose(1, 0))
i += 1
t2 = time.time()
print(t2-t1)
Hi @JohnnyRacer Thanks for your interest in this! Would be able to share with us the full notebook so that we can try to reproduce? Also can you run the script on CPU first, I suspect there is some label mismatching / index out of range at some point (see a similar issue here )
i.e. something badly configured when initialized the LoRA layers
Maybe you can just share the way you have initialized your model with peft
?
Hi @JohnnyRacer Thanks for your interest in this! Would be able to share with us the full notebook so that we can try to reproduce? Also can you run the script on CPU first, I suspect there is some label mismatching / index out of range at some point (see a similar issue here )
i.e. something badly configured when initialized the LoRA layers
Maybe you can just share the way you have initialized your model with
peft
?
@younesbelkada
Hello, here is the link to the Colab notebook . Unfortunately my system has insufficient RAM to load the model for training on the CPU, for some reason peft
requires much more system RAM than VRAM (not sure if this intentional or a bug). Also if it's any help, my transformers
version is 4.26.1
and peft
is version 0.1.0.dev0
, the code I am using to initialize the model with peft
is below:
from peft import LoraConfig, get_peft_model,TaskType
def print_trainable_parameters(model):
"""
Prints the number of trainable parameters in the model.
"""
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(
f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
print_trainable_parameters(model)
Hello, I'm able to run it with INT8 training without any issues. Here is the colab notebook:https://colab.research.google.com/drive/1mYBWjqwHDRGz0cxlmV4IyGOr9CxqN9MF?usp=sharing
Hi @pacman100 ,
In the colab that you shared appears 'No log' for the training loss. Why does that happen?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.