llama-recipes
llama-recipes copied to clipboard
Quickstart notebook breaking
System Info
Pytorch=2.1.0, CUDA=11.8, GPU=a10g (24 GB), Num of GPUs=1
Information
- [x] The official example scripts
- [ ] My own modified scripts
🐛 Describe the bug
After the recent changes with the Concatenator module and downstream changes in the dataset modules (samsum dataset in particular), the HF trainer throws an error in the quickstart notebook.
Error logs
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[26], line 31
22 trainer = Trainer(
23 model=model,
24 args=training_args,
(...)
27 callbacks=[profiler_callback] if enable_profiler else [],
28 )
30 # Start training
---> 31 trainer.train()
File ~/llama_repo/llama-recipes/llama_recipes_venv/lib/python3.10/site-packages/transformers/trainer.py:1555, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1553 hf_hub_utils.enable_progress_bars()
1554 else:
-> 1555 return inner_training_loop(
1556 args=args,
1557 resume_from_checkpoint=resume_from_checkpoint,
1558 trial=trial,
1559 ignore_keys_for_eval=ignore_keys_for_eval,
1560 )
File ~/llama_repo/llama-recipes/llama_recipes_venv/lib/python3.10/site-packages/transformers/trainer.py:1838, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
1835 rng_to_sync = True
1837 step = -1
-> 1838 for step, inputs in enumerate(epoch_iterator):
1839 total_batched_samples += 1
1840 if rng_to_sync:
File ~/llama_repo/llama-recipes/llama_recipes_venv/lib/python3.10/site-packages/accelerate/data_loader.py:451, in DataLoaderShard.__iter__(self)
449 # We iterate one batch ahead to check when we are at the end
450 try:
--> 451 current_batch = next(dataloader_iter)
452 except StopIteration:
453 yield
File ~/llama_repo/llama-recipes/llama_recipes_venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py:630, in _BaseDataLoaderIter.__next__(self)
627 if self._sampler_iter is None:
628 # TODO(https://github.com/pytorch/pytorch/issues/76750)
629 self._reset() # type: ignore[call-arg]
--> 630 data = self._next_data()
631 self._num_yielded += 1
632 if self._dataset_kind == _DatasetKind.Iterable and \
633 self._IterableDataset_len_called is not None and \
634 self._num_yielded > self._IterableDataset_len_called:
File ~/llama_repo/llama-recipes/llama_recipes_venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py:674, in _SingleProcessDataLoaderIter._next_data(self)
672 def _next_data(self):
673 index = self._next_index() # may raise StopIteration
--> 674 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
675 if self._pin_memory:
676 data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)
File ~/llama_repo/llama-recipes/llama_recipes_venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py:54, in _MapDatasetFetcher.fetch(self, possibly_batched_index)
52 else:
53 data = self.dataset[possibly_batched_index]
---> 54 return self.collate_fn(data)
File ~/llama_repo/llama-recipes/llama_recipes_venv/lib/python3.10/site-packages/transformers/data/data_collator.py:70, in default_data_collator(features, return_tensors)
64 # In this function we'll make the assumption that all `features` in the batch
65 # have the same attributes.
66 # So we will look at the first element as a proxy for what attributes exist
67 # on the whole batch.
69 if return_tensors == "pt":
---> 70 return torch_default_data_collator(features)
71 elif return_tensors == "tf":
72 return tf_default_data_collator(features)
File ~/llama_repo/llama-recipes/llama_recipes_venv/lib/python3.10/site-packages/transformers/data/data_collator.py:136, in torch_default_data_collator(features)
134 batch[k] = torch.tensor(np.stack([f[k] for f in features]))
135 else:
--> 136 batch[k] = torch.tensor([f[k] for f in features])
138 return batch
ValueError: expected sequence of length 312 at dim 1 (got 398)
Expected behavior
I added the following cell to process the train_dataset before passing train_batched
it to the trainer but I am not sure if this is the correct way to go about it:
from llama_recipes.data.concatenator import ConcatDataset
context_length = 2048
train_batched = ConcatDataset(train_dataset, chunk_size=context_length)
I was having the same problem on a 4090, but running out of CUDA memory. Adjusting the context_length in the above work around to context_length = 1024 allowed the training to commence.
Thanks Nomiizz!
same question. the main QuickStart has error ValueError: expected sequence of length 312 at dim 1 (got 398)
in recently version of package with huggingface samsum dataset:
datasets 2.14.7
transformers 4.35.2
torch 2.0.1
and i have no idea where is that dataset error happen.
@Nomiizz, that's a workaround. If you read the codes of ConcatDataset, you will find this function concat all input_ids to context_length. For example, we have inputs like [[1,2,3,4,5],[6,7,8],[9,10,11],[12,13,14,15]] and context_length=5 and batch_size=2, then output is:
[1,2,3,4,5]
[6,7,8,9,10]
[11,12,13,14,15]
Which means model will see something like:
<bos>"Summarize this dialog:\n{{dialog1}}\n---\nSummary:\n"+{{sumary1}}+ "Summarize this dialog:\n{{dialog2}}\n---\nSummary2:\n"
They may or may not affect the results. Because Llama 2 is trained without padding(I guess the do something like ConcatDataset so all inputs are the same length so they do not need pad). But for our usecase, I think we should stick to padding.
There are many method to pad, the simplest one is to pad to a max length such as the max length of all documents. But it will cause gpu inefficient. The best one is to use dynamic padding to pad a batch to the max length of this batch(not the max length of whole dataset).
The notebook use default_data_collator which do not pad. So if you don't use something like ConcatDataset, They will fail to build a tensor with different length.
A better way is to use DataCollatorWithPadding. But it only pad input_ids and attention_mask. And it will not pad labels. So you should implement your own DataCollator. You can use this one from this issue
from dataclasses import dataclass
from random import randint
from typing import Any, Callable, Dict, List, NewType, Optional, Tuple, Union
from transformers.utils import PaddingStrategy
from transformers import PreTrainedTokenizerBase, BatchEncoding
@dataclass
class MyDataCollatorWithPadding:
"""
Data collator that will dynamically pad the inputs received.
Args:
tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]):
The tokenizer used for encoding the data.
padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
among:
- `True` or `'longest'` (default): Pad to the longest sequence in the batch (or no padding if only a single
sequence is provided).
- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
acceptable input length for the model if that argument is not provided.
- `False` or `'do_not_pad'`: No padding (i.e., can output a batch with sequences of different lengths).
max_length (`int`, *optional*):
Maximum length of the returned list and optionally padding length (see above).
pad_to_multiple_of (`int`, *optional*):
If set will pad the sequence to a multiple of the provided value.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
7.5 (Volta).
return_tensors (`str`, *optional*, defaults to `"pt"`):
The type of Tensor to return. Allowable values are "np", "pt" and "tf".
"""
tokenizer : PreTrainedTokenizerBase
padding: Union[bool, str, PaddingStrategy] = True
max_length: Optional[int] = None
pad_to_multiple_of: Optional[int] = None
return_tensors: str = "pt"
def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
padding_features = [{key : val for key, val in row.items() if key in ['input_ids','attention_mask']} for row in features]
batch = self.tokenizer.pad(
padding_features,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors=None,
)
batch['labels'] = self.tokenizer.pad(
[{'input_ids' : row['labels']} for row in features],
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors=None,
)['input_ids']
for row in features:
for key, value in row.items():
if key in ['input_ids','attention_mask','labels']:
continue
if key not in batch:
batch[key] = []
batch[key].append(value)
return BatchEncoding(batch, tensor_type=self.return_tensors)
And change the code like:
data_collator = MyDataCollatorWithPadding(tokenizer=tokenizer,
padding="longest", max_length=4096)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
# data_collator=default_data_collator,
data_collator=data_collator,
callbacks=[profiler_callback] if enable_profiler else [],
)
I encounter the same problem.