transformers Trainer predict method throws out of memory error on GPT2 during testing

System Info

transformers version: 4.39.2 python 3.12 platform linux

Who can help?

No response

Information

[X] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling

dataset = load_dataset('xsum') # something like xsum but custom where there's text and summary

# preprocess the data for summarization

def preprocess(data):
    data_combined = []
    data_combined.append(data['text'] + " TL;DR " + data['summary'])
    return tokenizer(data_combined)

def data2equal_size_tokens(data):
    # split tokens into equal context size
     return tokenized_data_chunks

args = TrainingArguments(
    output_dir="gpt2_checkpoints",
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    evaluation_strategy="epoch",
    eval_steps=1,
    logging_steps=1,
    gradient_accumulation_steps=8,
    num_train_epochs=100,
    weight_decay=0.1,
    warmup_steps=1_000,
    lr_scheduler_type="cosine",
    learning_rate=5e-4,
    save_total_limit=3,
    overwrite_output_dir=True,
)

trainer = Trainer(
    model=gpt2model,
    tokenizer=tokenizer,
    args=args,
    data_collator=data_collator,
    train_dataset=tokenized_data_chunks["train"],
    eval_dataset=tokenized_data_chunks["valid"],
)

This seems to be somewhat training even though at some point the loss doesn't decrease any further.

After training is completed calling trainer.predict(tokenized_data_chunks['test']) throws the following error:

---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[63], line 1
----> 1 results = trainer.predict(tokenized_data_ctx_chunks['test'])

File /python3.12/site-packages/transformers/trainer.py:3441, in Trainer.predict(self, test_dataset, ignore_keys, metric_key_prefix)
   3438 start_time = time.time()
   3440 eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else self.evaluation_loop
-> 3441 output = eval_loop(
   3442     test_dataloader, description="Prediction", ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix
   3443 )
   3444 total_batch_size = self.args.eval_batch_size * self.args.world_size
   3445 if f"{metric_key_prefix}_jit_compilation_time" in output.metrics:

File /python3.12/site-packages/transformers/trainer.py:3580, in Trainer.evaluation_loop(self, dataloader, description, prediction_loss_only, ignore_keys, metric_key_prefix)
   3578         logits = self.preprocess_logits_for_metrics(logits, labels)
   3579     logits = self.gather_function((logits))
-> 3580     preds_host = logits if preds_host is None else nested_concat(preds_host, logits, padding_index=-100)
   3582 if labels is not None:
   3583     labels = self.gather_function((labels))

File /python3.12/site-packages/transformers/trainer_pt_utils.py:140, in nested_concat(tensors, new_tensors, padding_index)
    138     return type(tensors)(nested_concat(t, n, padding_index=padding_index) for t, n in zip(tensors, new_tensors))
    139 elif isinstance(tensors, torch.Tensor):
--> 140     return torch_pad_and_concatenate(tensors, new_tensors, padding_index=padding_index)
    141 elif isinstance(tensors, Mapping):
    142     return type(tensors)(
    143         {k: nested_concat(t, new_tensors[k], padding_index=padding_index) for k, t in tensors.items()}
    144     )

File /python3.12/site-packages/transformers/trainer_pt_utils.py:99, in torch_pad_and_concatenate(tensor1, tensor2, padding_index)
     96 tensor2 = atleast_1d(tensor2)
     98 if len(tensor1.shape) == 1 or tensor1.shape[1] == tensor2.shape[1]:
---> 99     return torch.cat((tensor1, tensor2), dim=0)
    101 # Let's figure out the new shape
    102 new_shape = (tensor1.shape[0] + tensor2.shape[0], max(tensor1.shape[1], tensor2.shape[1])) + tensor1.shape[2:]

OutOfMemoryError: CUDA out of memory. Tried to allocate 10.74 GiB. GPU 0 has a total capacity of 23.69 GiB of which 2.09 GiB is free. Including non-PyTorch memory, this process has 21.59 GiB memory in use. Of the allocated memory 12.16 GiB is allocated by PyTorch, and 9.12 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Expected behavior

Expected to work without issues.

There's a number of other related issues with this error.

For instance, adding compute_metrics to the trainer produces OOM error during training. Reducing config.n_ctx, config.n_positions or tokenizer.model_max_length from 1024 to 128 doesn't change anything. In order to avoid the OOM error during training we can add preprocess_logits_for_metrics which resolves the OOM errors during training but now there's seem to be some kind of stagnation during training and the model doesn't train any more cause all the metrics plateau at some point never pick up.

Adding tokens to the tokenizer prior to tokenizing the data and any training results in errors. For instance, one can add tokens tokenizer.add_special_tokens({'pad_token': '<|pad|>', 'sep_token': '<|sep|>', 'bos_token': '<|startoftext|>'}). Then we can proceed training with trainer.train() but without specifying any compute_metrics in the trainer. Once training is over we call model.generate and there's an error for not setting max_new_tokens since our context length is 128 instead of the original 1024 of the model.

Besides that there's a warning message stating the following:

A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.

Did I do something wrong in the preprocessing step of the data? Do we need to add bos_token when preprocessing the data? Is the format that I'm using for the data preprocessing correct for summarization?

Apr 04 '24 16:04 kirk86

cc @muellerzr might be specific to the predic function running on the whole test set -> bound to run OOM ?

Apr 05 '24 11:04 ArthurZucker

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 05 '24 08:05 github-actions[bot]

I believe https://github.com/huggingface/transformers/pull/28769 implemented a fix!

May 06 '24 13:05 muellerzr

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 31 '24 08:05 github-actions[bot]

transformers transformers copied to clipboard

Trainer predict method throws out of memory error on GPT2 during testing

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard