litgpt TypeError: TextInputSequence must be str

Bug description

⚡ ~ litgpt finetune_lora meta-llama/Llama-3.2-1B   --data JSON   --data.json_path sanksrit-dataset.json   --data.val_split_fraction 0.1   --train.epochs 1   --out_dir out/llama-3.2-finetuned   --precision bf16-true > res
Seed set to 1337
Traceback (most recent call last):
  File "/home/zeus/miniconda3/envs/cloudspace/bin/litgpt", line 8, in <module>
    sys.exit(main())
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/__main__.py", line 71, in main
    CLI(parser_data)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/jsonargparse/_cli.py", line 119, in CLI
    return _run_component(component, init.get(subcommand))
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/jsonargparse/_cli.py", line 204, in _run_component
    return component(**cfg)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/finetune/lora.py", line 169, in setup
    fabric.launch(main, devices, seed, config, data, checkpoint_dir, out_dir, train, eval, optimizer)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 843, in launch
    return self._wrap_and_launch(function, self, *args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 929, in _wrap_and_launch
    return to_run(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/lightning/fabric/fabric.py", line 934, in _wrap_with_setup
    return to_run(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/finetune/lora.py", line 218, in main
    fit(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/finetune/lora.py", line 270, in fit
    longest_seq_length, longest_seq_ix = get_longest_seq_length(ConcatDataset([train_dataloader.dataset, val_dataloader.dataset]))
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/finetune/lora.py", line 438, in get_longest_seq_length
    lengths = [len(d["input_ids"]) for d in data]
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/finetune/lora.py", line 438, in <listcomp>
    lengths = [len(d["input_ids"]) for d in data]
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/data/dataset.py", line 335, in __getitem__
    return self.datasets[dataset_idx][sample_idx]
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/data/base.py", line 83, in __getitem__
    encoded_response = self.tokenizer.encode(example["output"], bos=False, eos=True, max_length=self.max_seq_length)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/litgpt/tokenizer.py", line 114, in encode
    tokens = self.processor.encode(string).ids
TypeError: TextInputSequence must be str

Dataset looks like:

[
  {
    "input": "ये त्रि॑ष॒प्ताः प॑रि॒यन्ति॒ विश्वा॑ रू॒पाणि॒ बिभ्र॑तः । वा॒चस्पति॒र्बला॒ तेषां॑ त॒न्वो॑ अ॒द्य द॑धातु मे  ॥ (१)",
    "output": "The three qualities of Rajogun, Tamogun and Satogun and earth, water, tej, air, sky, tanmatra and ego, the seven substances travel everywhere in divine form, brahma, the swami of speech, give me the divine power of those elements and substances. (1)",
    "instruction": "Convert Sanskrit Text to English"
  },
  ......
]

What operating system are you using?

Linux

LitGPT Version

0.4.13

Sep 28 '24 18:09 hemanth

Hi there, could you try this with a very small text example that only consists of a few entries, e.g., repeated versions of the entry you showed:

[
  {
    "input": "ये त्रि॑ष॒प्ताः प॑रि॒यन्ति॒ विश्वा॑ रू॒पाणि॒ बिभ्र॑तः । वा॒चस्पति॒र्बला॒ तेषां॑ त॒न्वो॑ अ॒द्य द॑धातु मे  ॥ (१)",
    "output": "The three qualities of Rajogun, Tamogun and Satogun and earth, water, tej, air, sky, tanmatra and ego, the seven substances travel everywhere in divine form, brahma, the swami of speech, give me the divine power of those elements and substances. (1)",
    "instruction": "Convert Sanskrit Text to English"
  },
  {
    "input": "ये त्रि॑ष॒प्ताः प॑रि॒यन्ति॒ विश्वा॑ रू॒पाणि॒ बिभ्र॑तः । वा॒चस्पति॒र्बला॒ तेषां॑ त॒न्वो॑ अ॒द्य द॑धातु मे  ॥ (१)",
    "output": "The three qualities of Rajogun, Tamogun and Satogun and earth, water, tej, air, sky, tanmatra and ego, the seven substances travel everywhere in divine form, brahma, the swami of speech, give me the divine power of those elements and substances. (1)",
    "instruction": "Convert Sanskrit Text to English"
  },
  {
    "input": "ये त्रि॑ष॒प्ताः प॑रि॒यन्ति॒ विश्वा॑ रू॒पाणि॒ बिभ्र॑तः । वा॒चस्पति॒र्बला॒ तेषां॑ त॒न्वो॑ अ॒द्य द॑धातु मे  ॥ (१)",
    "output": "The three qualities of Rajogun, Tamogun and Satogun and earth, water, tej, air, sky, tanmatra and ego, the seven substances travel everywhere in divine form, brahma, the swami of speech, give me the divine power of those elements and substances. (1)",
    "instruction": "Convert Sanskrit Text to English"
  },
  {
    "input": "ये त्रि॑ष॒प्ताः प॑रि॒यन्ति॒ विश्वा॑ रू॒पाणि॒ बिभ्र॑तः । वा॒चस्पति॒र्बला॒ तेषां॑ त॒न्वो॑ अ॒द्य द॑धातु मे  ॥ (१)",
    "output": "The three qualities of Rajogun, Tamogun and Satogun and earth, water, tej, air, sky, tanmatra and ego, the seven substances travel everywhere in divine form, brahma, the swami of speech, give me the divine power of those elements and substances. (1)",
    "instruction": "Convert Sanskrit Text to English"
  },
]

This is just to further find out if the issue is because of non-Latin characters in the input field or maybe because some of the fields potentially have other formatting issues.

Oct 02 '24 12:10 rasbt

@hemanth Any update on this!? I am getting the exact same error -- if it's about the character encoding that I would do something...

Apr 07 '25 16:04 AayushSameerShah

@AayushSameerShah is your data the same as @hemanth 's example?

Apr 07 '25 17:04 craigpfeifer

@craigpfeifer It has the same format. Alpaca. -- But the news is I figured it out. In my data there were some rows which had NULL values. So did a quick df.dropna and then it works just fine.

:)

Apr 07 '25 17:04 AayushSameerShah

In my data there were some rows which had NULL values. So did a quick df.dropna and then it works just fine.

yes, that sounds like the right way to do

Apr 22 '25 11:04 Borda