llama-recipes icon indicating copy to clipboard operation
llama-recipes copied to clipboard

Quickstart notebook breaking

Open Nomiizz opened this issue 1 year ago • 4 comments

System Info

Pytorch=2.1.0, CUDA=11.8, GPU=a10g (24 GB), Num of GPUs=1

Information

  • [x] The official example scripts
  • [ ] My own modified scripts

🐛 Describe the bug

After the recent changes with the Concatenator module and downstream changes in the dataset modules (samsum dataset in particular), the HF trainer throws an error in the quickstart notebook.

Error logs

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[26], line 31
     22 trainer = Trainer(
     23     model=model,
     24     args=training_args,
   (...)
     27     callbacks=[profiler_callback] if enable_profiler else [],
     28 )
     30 # Start training
---> 31 trainer.train()

File ~/llama_repo/llama-recipes/llama_recipes_venv/lib/python3.10/site-packages/transformers/trainer.py:1555, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1553         hf_hub_utils.enable_progress_bars()
   1554 else:
-> 1555     return inner_training_loop(
   1556         args=args,
   1557         resume_from_checkpoint=resume_from_checkpoint,
   1558         trial=trial,
   1559         ignore_keys_for_eval=ignore_keys_for_eval,
   1560     )

File ~/llama_repo/llama-recipes/llama_recipes_venv/lib/python3.10/site-packages/transformers/trainer.py:1838, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   1835     rng_to_sync = True
   1837 step = -1
-> 1838 for step, inputs in enumerate(epoch_iterator):
   1839     total_batched_samples += 1
   1840     if rng_to_sync:

File ~/llama_repo/llama-recipes/llama_recipes_venv/lib/python3.10/site-packages/accelerate/data_loader.py:451, in DataLoaderShard.__iter__(self)
    449 # We iterate one batch ahead to check when we are at the end
    450 try:
--> 451     current_batch = next(dataloader_iter)
    452 except StopIteration:
    453     yield

File ~/llama_repo/llama-recipes/llama_recipes_venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py:630, in _BaseDataLoaderIter.__next__(self)
    627 if self._sampler_iter is None:
    628     # TODO(https://github.com/pytorch/pytorch/issues/76750)
    629     self._reset()  # type: ignore[call-arg]
--> 630 data = self._next_data()
    631 self._num_yielded += 1
    632 if self._dataset_kind == _DatasetKind.Iterable and \
    633         self._IterableDataset_len_called is not None and \
    634         self._num_yielded > self._IterableDataset_len_called:

File ~/llama_repo/llama-recipes/llama_recipes_venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py:674, in _SingleProcessDataLoaderIter._next_data(self)
    672 def _next_data(self):
    673     index = self._next_index()  # may raise StopIteration
--> 674     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    675     if self._pin_memory:
    676         data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)

File ~/llama_repo/llama-recipes/llama_recipes_venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py:54, in _MapDatasetFetcher.fetch(self, possibly_batched_index)
     52 else:
     53     data = self.dataset[possibly_batched_index]
---> 54 return self.collate_fn(data)

File ~/llama_repo/llama-recipes/llama_recipes_venv/lib/python3.10/site-packages/transformers/data/data_collator.py:70, in default_data_collator(features, return_tensors)
     64 # In this function we'll make the assumption that all `features` in the batch
     65 # have the same attributes.
     66 # So we will look at the first element as a proxy for what attributes exist
     67 # on the whole batch.
     69 if return_tensors == "pt":
---> 70     return torch_default_data_collator(features)
     71 elif return_tensors == "tf":
     72     return tf_default_data_collator(features)

File ~/llama_repo/llama-recipes/llama_recipes_venv/lib/python3.10/site-packages/transformers/data/data_collator.py:136, in torch_default_data_collator(features)
    134             batch[k] = torch.tensor(np.stack([f[k] for f in features]))
    135         else:
--> 136             batch[k] = torch.tensor([f[k] for f in features])
    138 return batch

ValueError: expected sequence of length 312 at dim 1 (got 398)

Expected behavior

I added the following cell to process the train_dataset before passing train_batched it to the trainer but I am not sure if this is the correct way to go about it:

from llama_recipes.data.concatenator import ConcatDataset

context_length = 2048

train_batched = ConcatDataset(train_dataset, chunk_size=context_length)

Nomiizz avatar Nov 03 '23 20:11 Nomiizz

I was having the same problem on a 4090, but running out of CUDA memory. Adjusting the context_length in the above work around to context_length = 1024 allowed the training to commence.

Thanks Nomiizz!

rkaunismaa avatar Nov 06 '23 15:11 rkaunismaa

same question. the main QuickStart has error ValueError: expected sequence of length 312 at dim 1 (got 398) in recently version of package with huggingface samsum dataset: datasets 2.14.7 transformers 4.35.2 torch 2.0.1 and i have no idea where is that dataset error happen.

macqueen09 avatar Nov 17 '23 02:11 macqueen09

@Nomiizz, that's a workaround. If you read the codes of ConcatDataset, you will find this function concat all input_ids to context_length. For example, we have inputs like [[1,2,3,4,5],[6,7,8],[9,10,11],[12,13,14,15]] and context_length=5 and batch_size=2, then output is:

[1,2,3,4,5]
[6,7,8,9,10]
[11,12,13,14,15]

Which means model will see something like:

<bos>"Summarize this dialog:\n{{dialog1}}\n---\nSummary:\n"+{{sumary1}}+ "Summarize this dialog:\n{{dialog2}}\n---\nSummary2:\n"

They may or may not affect the results. Because Llama 2 is trained without padding(I guess the do something like ConcatDataset so all inputs are the same length so they do not need pad). But for our usecase, I think we should stick to padding.

There are many method to pad, the simplest one is to pad to a max length such as the max length of all documents. But it will cause gpu inefficient. The best one is to use dynamic padding to pad a batch to the max length of this batch(not the max length of whole dataset).

The notebook use default_data_collator which do not pad. So if you don't use something like ConcatDataset, They will fail to build a tensor with different length.

A better way is to use DataCollatorWithPadding. But it only pad input_ids and attention_mask. And it will not pad labels. So you should implement your own DataCollator. You can use this one from this issue

from dataclasses import dataclass
from random import randint
from typing import Any, Callable, Dict, List, NewType, Optional, Tuple, Union
from transformers.utils import PaddingStrategy
from transformers import PreTrainedTokenizerBase, BatchEncoding

@dataclass
class MyDataCollatorWithPadding:
    """
    Data collator that will dynamically pad the inputs received.

    Args:
        tokenizer ([`PreTrainedTokenizer`] or [`PreTrainedTokenizerFast`]):
            The tokenizer used for encoding the data.
        padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `True`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding index)
            among:

            - `True` or `'longest'` (default): Pad to the longest sequence in the batch (or no padding if only a single
              sequence is provided).
            - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
              acceptable input length for the model if that argument is not provided.
            - `False` or `'do_not_pad'`: No padding (i.e., can output a batch with sequences of different lengths).
        max_length (`int`, *optional*):
            Maximum length of the returned list and optionally padding length (see above).
        pad_to_multiple_of (`int`, *optional*):
            If set will pad the sequence to a multiple of the provided value.

            This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >=
            7.5 (Volta).
        return_tensors (`str`, *optional*, defaults to `"pt"`):
            The type of Tensor to return. Allowable values are "np", "pt" and "tf".
    """

    tokenizer : PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    return_tensors: str = "pt"

    def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
        padding_features = [{key : val for key, val in row.items() if key in ['input_ids','attention_mask']} for row in features]
        
        batch = self.tokenizer.pad(
            padding_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors=None,
        )

        batch['labels'] = self.tokenizer.pad(
            [{'input_ids' : row['labels']} for row in features],
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors=None,
        )['input_ids']
        
 
        for row in features:
            for key, value in row.items():
                if key in ['input_ids','attention_mask','labels']:
                    continue
                if key not in batch:
                    batch[key] = []
                batch[key].append(value)

        return BatchEncoding(batch, tensor_type=self.return_tensors)

And change the code like:

data_collator = MyDataCollatorWithPadding(tokenizer=tokenizer, 
                                        padding="longest", max_length=4096)
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        # data_collator=default_data_collator,
        data_collator=data_collator,
        callbacks=[profiler_callback] if enable_profiler else [],
    )

fancyerii avatar Jan 10 '24 08:01 fancyerii

I encounter the same problem.

guyuchao avatar Mar 03 '24 07:03 guyuchao