litgpt icon indicating copy to clipboard operation
litgpt copied to clipboard

Llama3 finetuning and generation: Double begin_of_text, no eot_id

Open sanderland opened this issue 1 year ago • 10 comments

Bug description

When finetuning Llama3, the encoded data has:

  • Duplicate <|begin_of_text|> at the start
    • Tracked down to template + hf tokenizer both adding one.
  • No <|eot_id|> at the end in training -> #1694

Seems related to #1565, but may be more widespread across models.

Going by the example which downloads alpaca finance:

litgpt finetune_full meta-llama/Meta-Llama-3.1-8B-Instruct \
  --config configs/llama31-8b.yaml \
  --data JSON \
  --data.json_path my_custom_dataset.json \
  --data.mask_prompt True \
  --data.prompt_style llama3 \
  --data.val_split_fraction 0.05

and adding this to full.py along with support for skip_special_tokens=False

        if fabric.global_rank == 0 and state["iter_num"] == 1:
            non_pad_ids = input_ids[0][input_ids[0] != 0] # assume pad token id is 0
            fabric.print(f"First row of input ids with total shape {input_ids.shape}: {non_pad_ids}")
            fabric.print(f"Detokenized: {tokenizer.decode(non_pad_ids, skip_special_tokens=False)}")

gives

First row of input ids with total shape torch.Size([4, 765]): tensor([128000, 128000, 128006,   9125, 128007,    271,   264, [...] 459,   9341,     13]
Detokenized: <|begin_of_text|><|begin_of_text|><|start_header_id|> [..] accurate valuation of an investment.

What operating system are you using?

Unknown

LitGPT Version

(close to) main

sanderland avatar Aug 20 '24 13:08 sanderland

Thanks for raising that. Need to investigate in the next few days

rasbt avatar Aug 20 '24 16:08 rasbt

When you mentioned

(close to) main

could you check the version? Asking because I don't think that skip_special_tokens is a valid argument.

rasbt avatar Aug 21 '24 19:08 rasbt

When you mentioned

(close to) main

could you check the version? Asking because I don't think that skip_special_tokens is a valid argument.

version = "0.4.10", but when I said

adding this to full.py along with support for skip_special_tokens=False

I meant I added that option to help debug.

sanderland avatar Aug 21 '24 20:08 sanderland

Ah yes, the reason why I was asking is that I was getting a

TypeError: Tokenizer.decode() got an unexpected keyword argument 'skip_special_tokens'

and I was wondering where you applied this

rasbt avatar Aug 21 '24 21:08 rasbt

You can see my (somewhat messy) branch here: https://github.com/Lightning-AI/litgpt/compare/main...sanderland:dev?expand=1

sanderland avatar Aug 22 '24 08:08 sanderland

Ah thanks! I am still not understanding why this wouldn't work for me with a TypeError: Tokenizer.decode() got an unexpected keyword argument 'skip_special_tokens'. Need to investigate more (maybe a version issue).

Anyways, I just double-checked the generate_example function, and the for a prompt

What food do llamas eat?

The actual prompt that is passed to the tokenizer looks like this during finetuning with the default Alpaca style:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Recommend a movie for me to watch during the weekend and explain the reason.

### Response:

and then with the --data.prompt_style llama3 you were using:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Recommend a movie for me to watch during the weekend and explain the reason.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

So that part at least looks all ok to me.

rasbt avatar Aug 23 '24 01:08 rasbt

skip_special_tokens is a parameter in huggingface, but not in litgpt, I just added the pass-through to debug.

As for your prompt being correct, that doesn't mean the result of encode() is

from tokenizers import Tokenizer as HFTokenizer
processor = HFTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B")
processor.encode("prompt").ids # [128000, 41681] = "<|begin_of_text|>" , "prompt"

That is, there is a template which adds "<|begin_of_text|>" in the tokenizer.

sanderland avatar Aug 23 '24 07:08 sanderland

This is another confusing point https://github.com/Lightning-AI/litgpt/blob/main/litgpt/tokenizer.py#L91 The tokenizer has special logic to add a bos token to llama3, but both the huggingface tokenizer AND the template add one already. At least it checks so doesn't end up with 3.

sanderland avatar Aug 23 '24 08:08 sanderland

Actually I am curious as to how finetuning can work now given https://github.com/Lightning-AI/litgpt/issues/1699

calvintwr avatar Sep 03 '24 11:09 calvintwr

Going by the example which downloads alpaca finance:

litgpt finetune_full meta-llama/Meta-Llama-3.1-8B-Instruct \
  --config configs/llama31-8b.yaml \
  --data JSON \
  --data.json_path my_custom_dataset.json \
  --data.mask_prompt True \
  --data.prompt_style llama3 \
  --data.val_split_fraction 0.05

Could you please update the example?

Borda avatar Jun 23 '25 11:06 Borda