transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Title: CUDA RuntimeError: Unspecified Launch Failure during Training

Open Hongjie1Chu opened this issue 1 year ago • 10 comments

System Info

  • transformers version: 4.41.0
  • Platform: Linux-5.15.0-88-generic-x86_64-with-glibc2.35
  • Python version: 3.10.6
  • Huggingface_hub version: 0.23.0
  • Safetensors version: 0.4.3
  • Accelerate version: 0.27.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.2+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker @younesbelkada @muellerzr

Why does this error occur when passing a custom device_map? The map I wrote only differs from the auto-generated map in device order. Why does this cause an error? Does the device order affect the execution results?

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

import torch from torch import nn from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, LlamaForCausalLM from transformers import DataCollatorForLanguageModeling, DataCollatorWithPadding from transformers.utils.fx import symbolic_trace import argparse import numpy as np from datasets import load_metric, load_dataset

def compute_metrics(eval_preds): metric = load_metric("glue", "mrpc") logits, labels = eval_preds predictions = np.argmax(logits, axis=-1) return metric.compute(predictions=predictions, references=labels)

def tokenize_function(example): return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

if name == "main": parser = argparse.ArgumentParser() parser.add_argument('--gpus', type=int, help='the number of gpus', default=8) parser.add_argument('--modelName', type=str, help="the name of model", default='Llama2') parser.add_argument('--bs', type=int, help="the name of bs", default=4)

args = parser.parse_args()

# Step 1: Define the model
tokenizer = AutoTokenizer.from_pretrained('FlagAlpha/Atom-7B-Chat')
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

device_map = {
    'model.embed_tokens': 6,
    'model.layers.0': 6,
    'model.layers.1': 4,
    'model.layers.2': 1,
    'model.layers.3': 1,
    'model.layers.4': 1,
    'model.layers.5': 0,
    'model.layers.6': 0,
    'model.layers.7': 0,
    'model.layers.8': 0,
    'model.layers.9': 0,
    'model.layers.10': 6,
    'model.layers.11': 5,
    'model.layers.12': 5,
    'model.layers.13': 5,
    'model.layers.14': 5,
    'model.layers.15': 5,
    'model.layers.16': 4,
    'model.layers.17': 4,
    'model.layers.18': 4,
    'model.layers.19': 4,
    'model.layers.20': 3,
    'model.layers.21': 3,
    'model.layers.22': 3,
    'model.layers.23': 3,
    'model.layers.24': 3,
    'model.layers.25': 2,
    'model.layers.26': 2,
    'model.layers.27': 2,
    'model.layers.28': 2,
    'model.layers.29': 2,
    'model.layers.30': 1,
    'model.layers.31': 1,
    "model.norm.weight": 1,
    "lm_head": 6,
}

model = AutoModelForCausalLM.from_pretrained('FlagAlpha/Atom-7B-Chat', device_map=device_map, num_labels=2)

print(model)
print(model.hf_device_map)

print("gpt start train")

# Step 4: Load the dataset
data_files = {
    'train': '/mnt/glue_mrpc/train.jsonl',
    'test': '/mnt/glue_mrpc/test.jsonl',
    'validation': '/mnt/glue_mrpc/validation.jsonl'
}
raw_datasets = load_dataset('json', data_files=data_files)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.rename_column("label", 'labels')
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Step 5: Train the model
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=5,
    per_device_train_batch_size=args.bs,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

print('start train')
trainer.train()

Expected behavior

I want to know if the device order in the device_map affects the results.

Hongjie1Chu avatar May 20 '24 11:05 Hongjie1Chu

and when i set : device_map["model.embed_tokens"] = 0 device_map["model.norm.weight"] = 0

it will not error at start ,but it will error after: image

Hongjie1Chu avatar May 20 '24 11:05 Hongjie1Chu

Hi @Hongjie1Chu ! In principle the device order shouldn't affect the training behaviour - can you let us know what happens when you run the training script with CUDA_LAUNCH_BLOCKING=1 ? Also do you run your training script with accelerate launch xxx or python xxx.py?

younesbelkada avatar May 21 '24 07:05 younesbelkada

I too am facing a similar issue. I haven't made any changes to my code but all of a sudden, my code gives this error after training for like 30 steps.

Sharan1712 avatar May 21 '24 17:05 Sharan1712

Update: I downgraded my PEFT to 10.0 and Transformers to 4.39.0 and it is working fine now

Sharan1712 avatar May 22 '24 15:05 Sharan1712

thanks for your answer!

Hongjie1Chu avatar May 23 '24 05:05 Hongjie1Chu

Has there been a solution for this yet? I tried using the latest version of transformers and it still gave this issue. I want to use some of the new quantization methods.

Sharan1712 avatar Jun 06 '24 17:06 Sharan1712

@ArthurZucker @younesbelkada @muellerzr

Sharan1712 avatar Jun 10 '24 08:06 Sharan1712

Hi ! It is hard for us to debug without a proper error trace, can you re-run the training script with CUDA_LAUNCH_BLOCKING=1 and paste the error trace here?

younesbelkada avatar Jun 10 '24 10:06 younesbelkada

I believe I'm seeing the same issue with peft 0.11.1 and transformers 4.41.2 (both installed from conda-forge).

When I rerun with CUDA_LAUNCH_BLOCKING=1 I get:

RuntimeError                              Traceback (most recent call last)
Cell In[16], line 20
      5 trainer = SFTTrainer(
      6     model=model,
      7     train_dataset=full_doc_dataset,
   (...)
     15     compute_metrics=lambda eval_pred: compute_metrics(eval_pred, tokenizer)  # Pass tokenizer here
     16 )
     18 model = accelerator.prepare(model)
---> 20 trainer.train()

File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/trl/trainer/sft_trainer.py:440, in SFTTrainer.train(self, *args, **kwargs)
    437 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:
    438     self.model = self._trl_activate_neftune(self.model)
--> 440 output = super().train(*args, **kwargs)
    442 # After training we make sure to retrieve back the original forward pass method
    443 # for the embedding layer by removing the forward post hook.
    444 if self.neftune_noise_alpha is not None and not self._trainer_supports_neftune:

File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/transformers/trainer.py:1885, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1883         hf_hub_utils.enable_progress_bars()
   1884 else:
-> 1885     return inner_training_loop(
   1886         args=args,
   1887         resume_from_checkpoint=resume_from_checkpoint,
   1888         trial=trial,
   1889         ignore_keys_for_eval=ignore_keys_for_eval,
   1890     )

File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/transformers/trainer.py:2216, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   2213     self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
   2215 with self.accelerator.accumulate(model):
-> 2216     tr_loss_step = self.training_step(model, inputs)
   2218 if (
   2219     args.logging_nan_inf_filter
   2220     and not is_torch_xla_available()
   2221     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   2222 ):
   2223     # if loss is nan or inf simply add the average of previous logged losses
   2224     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/transformers/trainer.py:3241, in Trainer.training_step(***failed resolving arguments***)
   3238     loss = self.compute_loss(model, inputs)
   3240 del inputs
-> 3241 torch.cuda.empty_cache()
   3243 if self.args.n_gpu > 1:
   3244     loss = loss.mean()  # mean() to average on multi-gpu parallel training

File ~/.conda/envs/tl397_2/lib/python3.12/site-packages/torch/cuda/memory.py:162, in empty_cache()
    151 r"""Release all unoccupied cached memory currently held by the caching
    152 allocator so that those can be used in other GPU application and visible in
    153 `nvidia-smi`.
   (...)
    159     more details about GPU memory management.
    160 """
    161 if is_initialized():
--> 162     torch._C._cuda_emptyCache()

RuntimeError: CUDA error: unspecified launch failure
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

tlangfor avatar Jun 27 '24 11:06 tlangfor

cc @BenjaminBossan Are you the best person to ping for PEFT now?

amyeroberts avatar Jun 28 '24 18:06 amyeroberts

Hmm, I don't see how this is PEFT related, there is no PEFT code being used? Are you sure that the upgrade/downgrade of PEFT has any influence on the outcome and that it's not because of transformers?

BenjaminBossan avatar Jul 01 '24 11:07 BenjaminBossan

@BenjaminBossan Sorry, I was just skimming, saw peft mentioned and pinged you :)

Re SFTTrainer, perhaps @SunMarc is the best person here?

amyeroberts avatar Jul 01 '24 19:07 amyeroberts

Gentle ping @SunMarc

amyeroberts avatar Aug 20 '24 09:08 amyeroberts

Hi @Hongjie1Chu, I tried running your code with the current transformers & accelerate versions, but I run into the error :

 File "~/miniconda3/envs/dev/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 211, in forward
    freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:6! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

Can you try from your side ?

MekkCyber avatar Sep 30 '24 06:09 MekkCyber

I think #33742 should fix it

ArthurZucker avatar Oct 03 '24 12:10 ArthurZucker

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Oct 28 '24 08:10 github-actions[bot]