transformers
transformers copied to clipboard
Obtaining an Exception "KeyError: 'labels'" while fine-tuning Whisper
System Info
WLS 2.0 Ubuntu 22.04 transformers 4.44.2 python3.10
Who can help?
@sanchit-gandhi
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
- Use example from https://huggingface.co/blog/fine-tune-whisper as is. Wthout modify.
Get exception:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[22], line 1
----> 1 trainer.train()
File [~/.local/lib/python3.10/site-packages/transformers/trainer.py:1929](http://127.0.0.1:8888/home/artyom/.local/lib/python3.10/site-packages/transformers/trainer.py#line=1928), in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1926 try:
1927 # Disable progress bars when uploading models during checkpoints to avoid polluting stdout
1928 hf_hub_utils.disable_progress_bars()
-> 1929 return inner_training_loop(
1930 args=args,
1931 resume_from_checkpoint=resume_from_checkpoint,
1932 trial=trial,
1933 ignore_keys_for_eval=ignore_keys_for_eval,
1934 )
1935 finally:
1936 hf_hub_utils.enable_progress_bars()
File [~/.local/lib/python3.10/site-packages/transformers/trainer.py:2236](http://127.0.0.1:8888/home/artyom/.local/lib/python3.10/site-packages/transformers/trainer.py#line=2235), in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
2233 rng_to_sync = True
2235 step = -1
-> 2236 for step, inputs in enumerate(epoch_iterator):
2237 total_batched_samples += 1
2239 if self.args.include_num_input_tokens_seen:
File [~/.local/lib/python3.10/site-packages/accelerate/data_loader.py:454](http://127.0.0.1:8888/home/artyom/.local/lib/python3.10/site-packages/accelerate/data_loader.py#line=453), in DataLoaderShard.__iter__(self)
452 # We iterate one batch ahead to check when we are at the end
453 try:
--> 454 current_batch = next(dataloader_iter)
455 except StopIteration:
456 yield
File [~/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py:631](http://127.0.0.1:8888/home/artyom/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py#line=630), in _BaseDataLoaderIter.__next__(self)
628 if self._sampler_iter is None:
629 # TODO(https://github.com/pytorch/pytorch/issues/76750)
630 self._reset() # type: ignore[call-arg]
--> 631 data = self._next_data()
632 self._num_yielded += 1
633 if self._dataset_kind == _DatasetKind.Iterable and \
634 self._IterableDataset_len_called is not None and \
635 self._num_yielded > self._IterableDataset_len_called:
File ~/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py:675, in _SingleProcessDataLoaderIter._next_data(self)
673 def _next_data(self):
674 index = self._next_index() # may raise StopIteration
--> 675 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
676 if self._pin_memory:
677 data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)
File [~/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py:54](http://127.0.0.1:8888/home/artyom/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py#line=53), in _MapDatasetFetcher.fetch(self, possibly_batched_index)
52 else:
53 data = self.dataset[possibly_batched_index]
---> 54 return self.collate_fn(data)
Cell In[17], line 18, in DataCollatorSpeechSeq2SeqWithPadding.__call__(self, features)
15 batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
17 # get the tokenized label sequences
---> 18 label_features = [{"input_ids": feature["labels"]} for feature in features]
19 # pad the labels to max length
20 labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
Cell In[17], line 18, in <listcomp>(.0)
15 batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
17 # get the tokenized label sequences
---> 18 label_features = [{"input_ids": feature["labels"]} for feature in features]
19 # pad the labels to max length
20 labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
KeyError: 'labels'
Expected behavior
Starting model training.