Qwen2.5 icon indicating copy to clipboard operation
Qwen2.5 copied to clipboard

[Bug]: The final reason why you will get a model that cannot stop generation when you fine-tune the Qwen2.5-7b-base use Lora and a non-<|endoftext|> token as eos_token.

Open hxs91 opened this issue 3 months ago • 9 comments

Model Series

Qwen2.5

What are the models used?

Qwen2.5-7b-base

What is the scenario where the problem happened?

sft with huggingface trainer

Is this a known issue?

  • [X] I have followed the GitHub README.
  • [X] I have checked the Qwen documentation and cannot find an answer there.
  • [X] I have checked the documentation of the related framework and cannot find useful information.
  • [X] I have searched the issues and there is not a similar one.

Information about environment

Doesn't matter.

Log output

Doesn't matter.

Description

Steps to reproduce

  1. Use Qwen2.5-7b-base
  2. Modify its eos_token from <|endoftext|> to <|im_end|>(or any other specical token) in the tokenizer_config.json
  3. Use LoRA to fine-tune the model on your downstream task, makesure your LoRA will not fine-tune the lm_head and embedding.
  4. The fine-tuning data follow the input, instruction and output format and I place them with : input+instruction+output+eos_token, so do the labels.

Expected results

Expected: The fine-tuned model can produce text as you presented in the training data. Happend: The model can indeed generate appropriate text but cannot stop generation.

Attempts to fix

  1. check the generate function reveive the right eos_token_id
  2. check the training procedure get the right inputs_id and labels, make sure eos_token_id has been trained.

Final reason

If I change the eos_token back to <|endoftext|>, the model will have the right behavior.

After careful inspection, I found the reason: Qwen2.5-7b-base has the same weight in lm_head and embedding for additional tokens like <|im_end|>, <|object_ref_start|> and so on except for <|endoftext|>.

print("lm_head")
print("151643:"+str(model.lm_head.weight[151643]))
print("151644:"+str(model.lm_head.weight[151644]))
print("151645:"+str(model.lm_head.weight[151645]))
print("151646:"+str(model.lm_head.weight[151646]))
print("151647:"+str(model.lm_head.weight[151647]))
print("151648:"+str(model.lm_head.weight[151648]))
print("151649:"+str(model.lm_head.weight[151649]))
print("151650:"+str(model.lm_head.weight[151650]))
print("142333:"+str(model.lm_head.weight[142333]))
print("embedding")
print("151643:"+str(model.get_input_embeddings().weight[151643]))
print("151644:"+str(model.get_input_embeddings().weight[151644]))
print("151645:"+str(model.get_input_embeddings().weight[151645]))
print("151646:"+str(model.get_input_embeddings().weight[151646]))
print("151647:"+str(model.get_input_embeddings().weight[151647]))
print("151648:"+str(model.get_input_embeddings().weight[151648]))
print("151649:"+str(model.get_input_embeddings().weight[151649]))
print("151650:"+str(model.get_input_embeddings().weight[151650]))
print("142333:"+str(model.get_input_embeddings().weight[142333]))

the output

qwen2_base_7b
lm_head
151643:tensor([-0.0025, -0.0061, -0.0063,  ..., -0.0042, -0.0118,  0.0019],
       dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151644:tensor([ 0.0005,  0.0091,  0.0034,  ...,  0.0020,  0.0002, -0.0011],
       dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151645:tensor([ 0.0005,  0.0091,  0.0034,  ...,  0.0020,  0.0002, -0.0011],
       dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151646:tensor([ 0.0005,  0.0091,  0.0034,  ...,  0.0020,  0.0002, -0.0011],
       dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151647:tensor([ 0.0005,  0.0091,  0.0034,  ...,  0.0020,  0.0002, -0.0011],
       dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151648:tensor([ 0.0005,  0.0091,  0.0034,  ...,  0.0020,  0.0002, -0.0011],
       dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151649:tensor([ 0.0005,  0.0091,  0.0034,  ...,  0.0020,  0.0002, -0.0011],
       dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151650:tensor([ 0.0005,  0.0091,  0.0034,  ...,  0.0020,  0.0002, -0.0011],
       dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
142333:tensor([ 0.0005,  0.0091,  0.0034,  ...,  0.0020,  0.0002, -0.0011],
       dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
embedding
151643:tensor([-0.0186,  0.0347,  0.0092,  ...,  0.0040, -0.0077,  0.0006],
       dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151644:tensor([ 1.1755e-37, -1.1755e-37,  1.1755e-37,  ...,  1.1755e-37,
        -1.1755e-37, -1.1755e-37], dtype=torch.bfloat16,
       grad_fn=<SelectBackward0>)
151645:tensor([-1.1755e-37, -1.1755e-37,  1.1755e-37,  ...,  1.1755e-37,
        -1.1755e-37,  1.1755e-37], dtype=torch.bfloat16,
       grad_fn=<SelectBackward0>)
151646:tensor([-1.1755e-37, -1.1755e-37,  1.1755e-37,  ...,  1.1755e-37,
         1.1755e-37,  1.1755e-37], dtype=torch.bfloat16,
       grad_fn=<SelectBackward0>)
151647:tensor([ 1.1755e-37,  1.1755e-37, -1.1755e-37,  ...,  1.1755e-37,
         1.1755e-37, -1.1755e-37], dtype=torch.bfloat16,
       grad_fn=<SelectBackward0>)
151648:tensor([ 1.1755e-37, -1.1755e-37,  1.1755e-37,  ...,  1.1755e-37,
        -1.1755e-37,  1.1755e-37], dtype=torch.bfloat16,
       grad_fn=<SelectBackward0>)
151649:tensor([ 1.1755e-37, -1.1755e-37,  1.1755e-37,  ..., -1.1755e-37,
         1.1755e-37,  1.1755e-37], dtype=torch.bfloat16,
       grad_fn=<SelectBackward0>)
151650:tensor([-1.1755e-37, -1.1755e-37,  1.1755e-37,  ..., -1.1755e-37,
         1.1755e-37, -1.1755e-37], dtype=torch.bfloat16,
       grad_fn=<SelectBackward0>)
142333:tensor([ 1.1755e-37, -1.1755e-37, -1.1755e-37,  ...,  1.1755e-37,
         1.1755e-37,  1.1755e-37], dtype=torch.bfloat16,
       grad_fn=<SelectBackward0>)

151643 is the id of <|endoftext|>, 142333 is a random id I pick (maybe there are more ids like this), other ids are defined in the tokenizer_config.json. This explains why the fine-tuned model cannot stop generation: although the <|im_end|>(151645) is trained, but its weight in lm_head is same with many other ids, when it should be generated at inference, its logits will also equal to those ids, thus any id can be picked during decoding. In fact I observe logits for 151645 indeed incerease when it should be generated, but the same for those ids that have the same lm_head weights. This will not happen for 151643 , because it have a different lm_head weight, maybe it is trained during the pre-training stage.

I am quite confused why this will happen since all lm_head and embedding weights are initialized by normal distribution according to the released code.

hxs91 avatar Nov 08 '24 03:11 hxs91