litgpt LoRA model tokenizer configuration fails to load

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mmior/.local/share/virtualenvs/json-descriptions-cLP9Kwl8/lib/python3.10/site-packages/litgpt/tokenizer.py", line 39, in __init__
    self.bos_id = self.token_to_id(bos_token) if bos_token is not None else None
  File "/home/mmior/.local/share/virtualenvs/json-descriptions-cLP9Kwl8/lib/python3.10/site-packages/litgpt/tokenizer.py", line 62, in token_to_id
    id_ = self.processor.token_to_id(token)
TypeError: argument 'token': 'dict' object cannot be converted to 'PyString'

This happens because the value of bos_token and eos_token in tokenizer_config.json is not a string value, but a dictionary like the following:

  bos_token: {
    __type: AddedToken,
    content: <s>,
    lstrip: false,
    normalized: true,
    rstrip: false,
    single_word: false
  }

If I modify the tokenizer to check for a dictionary and use content, things seem to work fine.

Apr 01 '24 16:04 michaelmior

What tokenizer config from huggingface are you trying to load?

Apr 02 '24 17:04 carmocca

This isn't from Hugging Face, but a configuration output from LoRA finetuning using litgpt finetune lora.

Apr 02 '24 17:04 michaelmior

{
  "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + '[INST] ' + content | trim + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' '  + content | trim + ' ' + eos_token }}{% endif %}{% endfor %}",
  "add_bos_token": true,
  "add_eos_token": false,
  "bos_token": {
    "__type": "AddedToken",
    "content": "<s>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "clean_up_tokenization_spaces": false,
  "eos_token": {
    "__type": "AddedToken",
    "content": "</s>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "legacy": null,
  "model_max_length": 1000000000000000019884624838656,
  "pad_token": null,
  "sp_model_kwargs": {},
  "tokenizer_class": "CodeLlamaTokenizer",
  "unk_token": {
    "__type": "AddedToken",
    "content": "<unk>",
    "lstrip": false,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  }
}

Apr 02 '24 17:04 michaelmior

When you finetune, you load existing Hugging Face hub weights/tokenizer. LitGPT then copies over the tokenizer into your finetuned output so that it can be loaded in subsequent steps.

Did you manually copy over a different tokenizer or modified it yourself?

The tokenizer.py is a tiny shim over Hugging Face's tokenizers, so we haven't tried to support every possible tokenization config, just the ones that are used by the checkpoints we support. If you are running a "custom" tokenizer the code will need an update to check these different fields.

Apr 02 '24 17:04 carmocca

Just saw your last message. Looks like it's treating it as a HF tokenizer instead of SentencePiece tokenizer, so this line must be resolving to False: https://github.com/Lightning-AI/litgpt/blob/main/litgpt/tokenizer.py#L21

Apr 02 '24 17:04 carmocca

@carmocca It is true that there is no tokenizer.model and instead there is tokenizer.json and tokenizer_config.json. To clarify, this is just the output from litgpt and I didn't modify any of it or copy anything from Hugging Face.

Apr 02 '24 17:04 michaelmior

Which --checkpoint_dir did you use with LoRA? I can try to follow the same steps you did to see if I end up with the same error

Apr 02 '24 21:04 carmocca

@carmocca I downloaded codellama/CodeLlama-7b-Instruct-hf with litgpt download and then used that checkpoint.

Apr 03 '24 14:04 michaelmior