Llama3 finetuning and generation: Double begin_of_text, no eot_id
Bug description
When finetuning Llama3, the encoded data has:
- Duplicate <|begin_of_text|> at the start
- Tracked down to template + hf tokenizer both adding one.
- No <|eot_id|> at the end in training -> #1694
Seems related to #1565, but may be more widespread across models.
Going by the example which downloads alpaca finance:
litgpt finetune_full meta-llama/Meta-Llama-3.1-8B-Instruct \
--config configs/llama31-8b.yaml \
--data JSON \
--data.json_path my_custom_dataset.json \
--data.mask_prompt True \
--data.prompt_style llama3 \
--data.val_split_fraction 0.05
and adding this to full.py along with support for skip_special_tokens=False
if fabric.global_rank == 0 and state["iter_num"] == 1:
non_pad_ids = input_ids[0][input_ids[0] != 0] # assume pad token id is 0
fabric.print(f"First row of input ids with total shape {input_ids.shape}: {non_pad_ids}")
fabric.print(f"Detokenized: {tokenizer.decode(non_pad_ids, skip_special_tokens=False)}")
gives
First row of input ids with total shape torch.Size([4, 765]): tensor([128000, 128000, 128006, 9125, 128007, 271, 264, [...] 459, 9341, 13]
Detokenized: <|begin_of_text|><|begin_of_text|><|start_header_id|> [..] accurate valuation of an investment.
What operating system are you using?
Unknown
LitGPT Version
(close to) main
Thanks for raising that. Need to investigate in the next few days
When you mentioned
(close to) main
could you check the version? Asking because I don't think that skip_special_tokens is a valid argument.
When you mentioned
(close to) main
could you check the version? Asking because I don't think that
skip_special_tokensis a valid argument.
version = "0.4.10", but when I said
adding this to full.py along with support for skip_special_tokens=False
I meant I added that option to help debug.
Ah yes, the reason why I was asking is that I was getting a
TypeError: Tokenizer.decode() got an unexpected keyword argument 'skip_special_tokens'
and I was wondering where you applied this
You can see my (somewhat messy) branch here: https://github.com/Lightning-AI/litgpt/compare/main...sanderland:dev?expand=1
Ah thanks! I am still not understanding why this wouldn't work for me with a TypeError: Tokenizer.decode() got an unexpected keyword argument 'skip_special_tokens'. Need to investigate more (maybe a version issue).
Anyways, I just double-checked the generate_example function, and the for a prompt
What food do llamas eat?
The actual prompt that is passed to the tokenizer looks like this during finetuning with the default Alpaca style:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Recommend a movie for me to watch during the weekend and explain the reason.
### Response:
and then with the --data.prompt_style llama3 you were using:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
Recommend a movie for me to watch during the weekend and explain the reason.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
So that part at least looks all ok to me.
skip_special_tokens is a parameter in huggingface, but not in litgpt, I just added the pass-through to debug.
As for your prompt being correct, that doesn't mean the result of encode() is
from tokenizers import Tokenizer as HFTokenizer
processor = HFTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B")
processor.encode("prompt").ids # [128000, 41681] = "<|begin_of_text|>" , "prompt"
That is, there is a template which adds "<|begin_of_text|>" in the tokenizer.
This is another confusing point https://github.com/Lightning-AI/litgpt/blob/main/litgpt/tokenizer.py#L91 The tokenizer has special logic to add a bos token to llama3, but both the huggingface tokenizer AND the template add one already. At least it checks so doesn't end up with 3.
Actually I am curious as to how finetuning can work now given https://github.com/Lightning-AI/litgpt/issues/1699
Going by the example which downloads alpaca finance:
litgpt finetune_full meta-llama/Meta-Llama-3.1-8B-Instruct \ --config configs/llama31-8b.yaml \ --data JSON \ --data.json_path my_custom_dataset.json \ --data.mask_prompt True \ --data.prompt_style llama3 \ --data.val_split_fraction 0.05
Could you please update the example?