Issue with Dolly Dataloader: `context` key not found!
Bug description
I ran into the following issue while running LoRA fine-tuning.
Stack Trace
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/Users/lunit_haseebraja/miniconda3/envs/lora_tests/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
^^^^^^^^^^^^^^^^^^^^
File "/Users/lunit_haseebraja/miniconda3/envs/lora_tests/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/lunit_haseebraja/miniconda3/envs/lora_tests/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
~~~~~~~~~~~~^^^^^
File "/Users/lunit_haseebraja/miniconda3/envs/lora_tests/lib/python3.11/site-packages/litgpt/data/base.py", line 80, in __getitem__
example = self.transform(example)
^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/lunit_haseebraja/miniconda3/envs/lora_tests/lib/python3.11/site-packages/litgpt/data/dolly.py", line 74, in _transform
item["input"] = item.pop("context")
^^^^^^^^^^^^^^^^^^^
KeyError: 'context'
Command
litgpt finetune_lora checkpoints/EleutherAI/pythia-70m --data Dolly --precision 16-mixed --data.num_workers 4 --train.global_batch_size 1 --train.max_seq_length 512 --data.val_split_fraction 0.0
I spent some time debugging it. It seems like _transform method is being called twice at the beginning for some reason. During the second call, they keys are not there since we are using pop. It does work with get though.
In src/litgpt/litgpt/data/dolly.py (commented parts are for debugging):
# import sys
# from pprint import pprint
def _transform(idx: int, item: dict) -> dict:
# if "context" not in item.keys():
# print(f"{idx}: Missing Key!")
# pprint(item)
# sys.exit()
item["input"] = item.pop("context")
item["output"] = item.pop("response")
return item
I couldn't figure out why it is being called twice though.
What operating system are you using?
macOS
LitGPT Version
Tested on two versions. Also tested on two platforms macOS and linux.
litgpt 0.4.13
litgpt 0.4.14.dev1
@rasbt Maybe you can take a look if you have some time. I think the original implementation was done by you (if I am not mistaken).
Thanks for the note. Not sure what happened there. Maybe I forgot to adjust the dataset as we updated the data loader. I will try to take a look next week. (In the meantime, if you got it to work, I'd appreciate a PR)
@rasbt I will fix it within one to two days and create a PR.
@rasbt could you assign this issue to me before I begin?
Of course, happy to assign you (I just see that @Andrei-Aksionov already beat me to it though 😅)
@rasbt I will fix it within one to two days and create a PR.
Hello @pytholic , how is it going? Still aiming to fix it? :)
@rasbt I will fix it within one to two days and create a PR.
Hello @pytholic , how is it going? Still aiming to fix it? :)
Hello! Wow it has been a long time. Last I remember, fix was ready but I was unable to run the tests due to another issue which @rasbt was fixing at that time.
I haven't been able to follow up on it since then.