Obtaining an Exception "KeyError: 'labels'" while fine-tuning Whisper

Open artyomboyko opened this issue 1 year ago • 0 comments

System Info

WLS 2.0 Ubuntu 22.04 transformers 4.44.2 python3.10

Who can help?

@sanchit-gandhi

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Use example from https://huggingface.co/blog/fine-tune-whisper as is. Wthout modify.

Get exception:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[22], line 1
----> 1 trainer.train()

File [~/.local/lib/python3.10/site-packages/transformers/trainer.py:1929](http://127.0.0.1:8888/home/artyom/.local/lib/python3.10/site-packages/transformers/trainer.py#line=1928), in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1926 try:
   1927     # Disable progress bars when uploading models during checkpoints to avoid polluting stdout
   1928     hf_hub_utils.disable_progress_bars()
-> 1929     return inner_training_loop(
   1930         args=args,
   1931         resume_from_checkpoint=resume_from_checkpoint,
   1932         trial=trial,
   1933         ignore_keys_for_eval=ignore_keys_for_eval,
   1934     )
   1935 finally:
   1936     hf_hub_utils.enable_progress_bars()

File [~/.local/lib/python3.10/site-packages/transformers/trainer.py:2236](http://127.0.0.1:8888/home/artyom/.local/lib/python3.10/site-packages/transformers/trainer.py#line=2235), in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   2233     rng_to_sync = True
   2235 step = -1
-> 2236 for step, inputs in enumerate(epoch_iterator):
   2237     total_batched_samples += 1
   2239     if self.args.include_num_input_tokens_seen:

File [~/.local/lib/python3.10/site-packages/accelerate/data_loader.py:454](http://127.0.0.1:8888/home/artyom/.local/lib/python3.10/site-packages/accelerate/data_loader.py#line=453), in DataLoaderShard.__iter__(self)
    452 # We iterate one batch ahead to check when we are at the end
    453 try:
--> 454     current_batch = next(dataloader_iter)
    455 except StopIteration:
    456     yield

File [~/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py:631](http://127.0.0.1:8888/home/artyom/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py#line=630), in _BaseDataLoaderIter.__next__(self)
    628 if self._sampler_iter is None:
    629     # TODO(https://github.com/pytorch/pytorch/issues/76750)
    630     self._reset()  # type: ignore[call-arg]
--> 631 data = self._next_data()
    632 self._num_yielded += 1
    633 if self._dataset_kind == _DatasetKind.Iterable and \
    634         self._IterableDataset_len_called is not None and \
    635         self._num_yielded > self._IterableDataset_len_called:

File ~/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py:675, in _SingleProcessDataLoaderIter._next_data(self)
    673 def _next_data(self):
    674     index = self._next_index()  # may raise StopIteration
--> 675     data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    676     if self._pin_memory:
    677         data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)

File [~/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py:54](http://127.0.0.1:8888/home/artyom/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py#line=53), in _MapDatasetFetcher.fetch(self, possibly_batched_index)
     52 else:
     53     data = self.dataset[possibly_batched_index]
---> 54 return self.collate_fn(data)

Cell In[17], line 18, in DataCollatorSpeechSeq2SeqWithPadding.__call__(self, features)
     15 batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
     17 # get the tokenized label sequences
---> 18 label_features = [{"input_ids": feature["labels"]} for feature in features]
     19 # pad the labels to max length
     20 labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

Cell In[17], line 18, in <listcomp>(.0)
     15 batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
     17 # get the tokenized label sequences
---> 18 label_features = [{"input_ids": feature["labels"]} for feature in features]
     19 # pad the labels to max length
     20 labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

KeyError: 'labels'

Expected behavior

Starting model training.

Aug 27 '24 20:08 artyomboyko