TypeError: TextInputSequence must be str
Bug description
⚡ ~ litgpt finetune_lora meta-llama/Llama-3.2-1B --data JSON --data.json_path sanksrit-dataset.json --data.val_split_fraction 0.1 --train.epochs 1 --out_dir out/llama-3.2-finetuned --precision bf16-true > res
Seed set to 1337
Traceback (most recent call last):
File "/home/zeus/miniconda3/envs/cloudspace/bin/litgpt", line 8, in <module>
sys.exit(main())
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/__main__.py", line 71, in main
CLI(parser_data)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/jsonargparse/_cli.py", line 119, in CLI
return _run_component(component, init.get(subcommand))
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/jsonargparse/_cli.py", line 204, in _run_component
return component(**cfg)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/finetune/lora.py", line 169, in setup
fabric.launch(main, devices, seed, config, data, checkpoint_dir, out_dir, train, eval, optimizer)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 843, in launch
return self._wrap_and_launch(function, self, *args, **kwargs)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 929, in _wrap_and_launch
return to_run(*args, **kwargs)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 934, in _wrap_with_setup
return to_run(*args, **kwargs)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/finetune/lora.py", line 218, in main
fit(
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/finetune/lora.py", line 270, in fit
longest_seq_length, longest_seq_ix = get_longest_seq_length(ConcatDataset([train_dataloader.dataset, val_dataloader.dataset]))
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/finetune/lora.py", line 438, in get_longest_seq_length
lengths = [len(d["input_ids"]) for d in data]
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/finetune/lora.py", line 438, in <listcomp>
lengths = [len(d["input_ids"]) for d in data]
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataset.py", line 335, in __getitem__
return self.datasets[dataset_idx][sample_idx]
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/data/base.py", line 83, in __getitem__
encoded_response = self.tokenizer.encode(example["output"], bos=False, eos=True, max_length=self.max_seq_length)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/tokenizer.py", line 114, in encode
tokens = self.processor.encode(string).ids
TypeError: TextInputSequence must be str
Dataset looks like:
[
{
"input": "ये त्रि॑ष॒प्ताः प॑रि॒यन्ति॒ विश्वा॑ रू॒पाणि॒ बिभ्र॑तः । वा॒चस्पति॒र्बला॒ तेषां॑ त॒न्वो॑ अ॒द्य द॑धातु मे ॥ (१)",
"output": "The three qualities of Rajogun, Tamogun and Satogun and earth, water, tej, air, sky, tanmatra and ego, the seven substances travel everywhere in divine form, brahma, the swami of speech, give me the divine power of those elements and substances. (1)",
"instruction": "Convert Sanskrit Text to English"
},
......
]
What operating system are you using?
Linux
LitGPT Version
0.4.13
Hi there, could you try this with a very small text example that only consists of a few entries, e.g., repeated versions of the entry you showed:
[
{
"input": "ये त्रि॑ष॒प्ताः प॑रि॒यन्ति॒ विश्वा॑ रू॒पाणि॒ बिभ्र॑तः । वा॒चस्पति॒र्बला॒ तेषां॑ त॒न्वो॑ अ॒द्य द॑धातु मे ॥ (१)",
"output": "The three qualities of Rajogun, Tamogun and Satogun and earth, water, tej, air, sky, tanmatra and ego, the seven substances travel everywhere in divine form, brahma, the swami of speech, give me the divine power of those elements and substances. (1)",
"instruction": "Convert Sanskrit Text to English"
},
{
"input": "ये त्रि॑ष॒प्ताः प॑रि॒यन्ति॒ विश्वा॑ रू॒पाणि॒ बिभ्र॑तः । वा॒चस्पति॒र्बला॒ तेषां॑ त॒न्वो॑ अ॒द्य द॑धातु मे ॥ (१)",
"output": "The three qualities of Rajogun, Tamogun and Satogun and earth, water, tej, air, sky, tanmatra and ego, the seven substances travel everywhere in divine form, brahma, the swami of speech, give me the divine power of those elements and substances. (1)",
"instruction": "Convert Sanskrit Text to English"
},
{
"input": "ये त्रि॑ष॒प्ताः प॑रि॒यन्ति॒ विश्वा॑ रू॒पाणि॒ बिभ्र॑तः । वा॒चस्पति॒र्बला॒ तेषां॑ त॒न्वो॑ अ॒द्य द॑धातु मे ॥ (१)",
"output": "The three qualities of Rajogun, Tamogun and Satogun and earth, water, tej, air, sky, tanmatra and ego, the seven substances travel everywhere in divine form, brahma, the swami of speech, give me the divine power of those elements and substances. (1)",
"instruction": "Convert Sanskrit Text to English"
},
{
"input": "ये त्रि॑ष॒प्ताः प॑रि॒यन्ति॒ विश्वा॑ रू॒पाणि॒ बिभ्र॑तः । वा॒चस्पति॒र्बला॒ तेषां॑ त॒न्वो॑ अ॒द्य द॑धातु मे ॥ (१)",
"output": "The three qualities of Rajogun, Tamogun and Satogun and earth, water, tej, air, sky, tanmatra and ego, the seven substances travel everywhere in divine form, brahma, the swami of speech, give me the divine power of those elements and substances. (1)",
"instruction": "Convert Sanskrit Text to English"
},
]
This is just to further find out if the issue is because of non-Latin characters in the input field or maybe because some of the fields potentially have other formatting issues.
@hemanth Any update on this!? I am getting the exact same error -- if it's about the character encoding that I would do something...
@AayushSameerShah is your data the same as @hemanth 's example?
@craigpfeifer It has the same format. Alpaca. -- But the news is I figured it out. In my data there were some rows which had NULL values. So did a quick df.dropna and then it works just fine.
:)
In my data there were some rows which had NULL values. So did a quick
df.dropnaand then it works just fine.
yes, that sounds like the right way to do