transformers
transformers copied to clipboard
BlenderBot-Distil-400M training fails if the input or target length exceeds a certain threshold, even when truncation and padding is on
System Info
transformers version: 4.20.1, 4.21.0 Platform: Linux Python version: 3.7.6 Huggingface_hub version: 0.8.1 PyTorch version (GPU?): 1.10.2 (Yes) Tensorflow version (GPU?): not installed (NA) Flax version (CPU?/GPU?/TPU?): not installed (NA) Jax version: not installed JaxLib version: not installed Using GPU in script?: Yes (2+ Tesla V100) Using distributed or parallel set-up in script?: No
Who can help?
@patil-suraj
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
Run the following script with python script_blenderbot_length.py
# The contents of script_blenderbot_length.py
# To make the code crash, set CRITICAL_NUMBER=64
# To make it pass, set CRITICAL_NUMBER=63
# The code fails if EITHER the input or the target is repeated 64+ times.
from __future__ import annotations
import functools
import typing as tp
import datasets
import transformers
from transformers import (
DataCollatorForSeq2Seq,
PreTrainedTokenizer,
Seq2SeqTrainingArguments,
Seq2SeqTrainer,
)
CRITICAL_NUMBER = 64
increment_en = [
{"input": "One", "target": "Two"},
{"input": "Three "*2, "target": "Four "*2},
{"input": "Five "*4, "target": "Six "*4},
{"input": "Seven "*8, "target": "Eight "*8},
{"input": "Nine "*CRITICAL_NUMBER, "target": "Ten "*CRITICAL_NUMBER},
]
increment_en = increment_en * 100
def lod_to_dol(list_of_dicts: tp.List[tp.Dict[str, tp.Any]]) -> tp.Dict[str, list]:
dict_of_lists = {
key: [dct[key] for dct in list_of_dicts] for key in list_of_dicts[0]
}
return dict_of_lists
increment_en = lod_to_dol(increment_en)
def preprocess_function_(
examples,
tokenizer: PreTrainedTokenizer,
max_input_length: int,
max_target_length: int,
):
inputs = examples["input"]
targets = examples["target"]
model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
# Setup the tokenizer for targets
with tokenizer.as_target_tokenizer():
labels = tokenizer(targets, max_length=max_target_length, truncation=True)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
def main():
tokenizer = transformers.BlenderbotTokenizer.from_pretrained("facebook/blenderbot-400M-distill")
model = transformers.BlenderbotForConditionalGeneration.from_pretrained("facebook/blenderbot-400M-distill")
args = Seq2SeqTrainingArguments(
"script_debug",
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
fp16=True,
push_to_hub=False,
max_steps=10000,
logging_steps=5000,
save_steps=5000
)
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding=True)
dataset = datasets.DatasetDict(
{
"train": datasets.Dataset.from_dict(increment_en),
"test": datasets.Dataset.from_dict(increment_en),
}
)
preprocess_function = functools.partial(
preprocess_function_,
tokenizer=tokenizer,
max_input_length=512,
max_target_length=512
)
processed_ds = dataset.map(preprocess_function, batched=True)
processed_ds.set_format(
type="torch", columns=["input_ids", "attention_mask", "labels"]
)
trainer = Seq2SeqTrainer(
model,
args,
train_dataset=processed_ds["train"],
eval_dataset=processed_ds["test"],
data_collator=data_collator,
tokenizer=tokenizer,
)
trainer.train()
if __name__ == "__main__":
main()
Running the code when CRITICAL_NUMBER is set to 64 or greater leads to the bizarre series of CUDA asserts:
<Similar messages appear above, which are omitted for brevity>
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [36,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [37,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [38,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [39,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [40,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [41,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [42,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [43,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [44,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [45,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [46,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [47,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [48,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [49,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [50,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [52,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [53,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [54,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [55,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [56,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [57,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [58,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [59,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [60,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [61,0,0] Assertion `srcIndex < srcSelectDimSi
ze` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811797118/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [2,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
0%| | 0/10000 [00:07<?, ?it/s]
root@bolt-imq45r3c3y-8dfzr73qqa:/mnt/task_runtime# python script_blenderbot_length.py
100%|██████████████████████████| 1/1 [00:00<00:00, 5.30ba/s]
100%|██████████████████████████| 1/1 [00:00<00:00, 5.72ba/s]
max_steps is given, it will override any value given in num_train_epochs
Using cuda_amp half precision backend
The following columns in the training set don't have a corresponding argument in `BlenderbotForConditionalGeneration.forward` and have been ignored: target, input. If target, input are not expected by `BlenderbotForConditionalGeneration.forward`, you can safely ignore this message.
/miniconda/lib/python3.7/site-packages/transformers/optimization.py:310: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
FutureWarning,
***** Running training *****
Num examples = 500
Num Epochs = 313
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 16
Gradient Accumulation steps = 1
Total optimization steps = 10000
0%| | 0/10000 [00:00<?, ?it/s]Traceback (most recent call last):
File "script_blenderbot_length.py", line 101, in <module>
main()
File "script_blenderbot_length.py", line 97, in main
trainer.train()
File "/miniconda/lib/python3.7/site-packages/transformers/trainer.py", line 1502, in train
ignore_keys_for_eval=ignore_keys_for_eval,
File "/miniconda/lib/python3.7/site-packages/transformers/trainer.py", line 1740, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/miniconda/lib/python3.7/site-packages/transformers/trainer.py", line 2470, in training_step
loss = self.compute_loss(model, inputs)
File "/miniconda/lib/python3.7/site-packages/transformers/trainer.py", line 2502, in compute_loss
outputs = model(**inputs)
File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/miniconda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/miniconda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/miniconda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/miniconda/lib/python3.7/site-packages/torch/_utils.py", line 434, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/miniconda/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/miniconda/lib/python3.7/site-packages/transformers/models/blenderbot/modeling_blenderbot.py", line 1340, in forward
return_dict=return_dict,
File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/miniconda/lib/python3.7/site-packages/transformers/models/blenderbot/modeling_blenderbot.py", line 1181, in forward
return_dict=return_dict,
File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/miniconda/lib/python3.7/site-packages/transformers/models/blenderbot/modeling_blenderbot.py", line 785, in forward
output_attentions=output_attentions,
File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/miniconda/lib/python3.7/site-packages/transformers/models/blenderbot/modeling_blenderbot.py", line 318, in forward
output_attentions=output_attentions,
File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/miniconda/lib/python3.7/site-packages/transformers/models/blenderbot/modeling_blenderbot.py", line 180, in forward
query_states = self.q_proj(hidden_states) * self.scaling
File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
File "/miniconda/lib/python3.7/site-packages/torch/nn/functional.py", line 1848, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`
Expected behavior
The training code should not crash, especially when there are far fewer tokens than the tokenization limit.
Adding padding=True when tokenizing both the input and targets does not fix the issue.
Also, when running the script using the CPU only, I get this error:
root@pc:~ # CUDA_VISIBLE_DEVICES="" python script_blenderbot_length.py
100%|██████████████████████████| 1/1 [00:00<00:00, 4.95ba/s]
100%|██████████████████████████| 1/1 [00:00<00:00, 5.46ba/s]
max_steps is given, it will override any value given in num_train_epochs
The following columns in the training set don't have a corresponding argument in `BlenderbotForConditionalGeneration.forward` and have been ignored: target, input. If target, input are not expected by `BlenderbotForConditionalGeneration.forward`, you can safely ignore this message.
/miniconda/lib/python3.7/site-packages/transformers/optimization.py:310: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
FutureWarning,
***** Running training *****
Num examples = 500
Num Epochs = 80
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 4
Gradient Accumulation steps = 1
Total optimization steps = 10000
0%| | 0/10000 [00:00<?, ?it/s]
Traceback (most recent call last):
File "script_blenderbot_length.py", line 103, in <module>
main()
File "script_blenderbot_length.py", line 99, in main
trainer.train()
File "/miniconda/lib/python3.7/site-packages/transformers/trainer.py", line 1502, in train
ignore_keys_for_eval=ignore_keys_for_eval,
File "/miniconda/lib/python3.7/site-packages/transformers/trainer.py", line 1740, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/miniconda/lib/python3.7/site-packages/transformers/trainer.py", line 2470, in training_step
loss = self.compute_loss(model, inputs)
File "/miniconda/lib/python3.7/site-packages/transformers/trainer.py", line 2502, in compute_loss
outputs = model(**inputs)
File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/miniconda/lib/python3.7/site-packages/transformers/models/blenderbot/modeling_blenderbot.py", line 1340, in forward
return_dict=return_dict,
File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/miniconda/lib/python3.7/site-packages/transformers/models/blenderbot/modeling_blenderbot.py", line 1181, in forward
return_dict=return_dict,
File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/miniconda/lib/python3.7/site-packages/transformers/models/blenderbot/modeling_blenderbot.py", line 738, in forward
embed_pos = self.embed_positions(input_shape)
File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/miniconda/lib/python3.7/site-packages/transformers/models/blenderbot/modeling_blenderbot.py", line 125, in forward
return super().forward(positions)
File "/miniconda/lib/python3.7/site-packages/torch/nn/modules/sparse.py", line 160, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/miniconda/lib/python3.7/site-packages/torch/nn/functional.py", line 2044, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
0%| | 0/10000 [00:00<?, ?it/s]
I've found out why the error seems to appear. I modified transformers/src/transformers/models/blenderbot/modeling_blenderbot.py:BlenderbotLearnedPositionalEmbedding:forward (approximately near line 125).
positions = torch.arange(
past_key_values_length, past_key_values_length + seq_len, dtype=torch.long, device=self.weight.device
)
+ print(positions)
+ print(self.weight.shape)
return super().forward(positions)
When running the script, I get this in the output:
tensor([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41,
42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55,
56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69,
70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83,
84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97,
98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111,
112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125,
126, 127, 128, 129])
torch.Size([128, 1280])
Clearly, the positional embeddings are beyond the maximum range available. The question is why... Perhaps this can be configured in the constructor?
The length of the positions seems to be equal to 2*CRITICAL_NUMBER + 1.
And... it goes to a maximum of the tokenizer's max_length-1, which is expected, I guess.
Ah. So the issue is that in the BlenderbotConfig, max_position_embeddings is set to 128. The publicly available weights only have position embeddings with those dimensions, so either I'd have to train from scratch or reduce the max tokenizer length to 128.
But seriously, this exception should be caught and re-raised with a more human-readable expression.
(I can contribute a fix after my internship ends, not before)
Catching and re-raising the exception during GPU training doesn't result in a more human-readable expression (It's still RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling cublasCreate(handle), but at least the flood of CUDA asserts are gone). Getting a more human-readable exception seems to be only possible for CPU-only training.
cc @sgugger for usage with the Trainer!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.