Qwen2.5
Qwen2.5 copied to clipboard
[Bug]: The final reason why you will get a model that cannot stop generation when you fine-tune the Qwen2.5-7b-base use Lora and a non-<|endoftext|> token as eos_token.
Model Series
Qwen2.5
What are the models used?
Qwen2.5-7b-base
What is the scenario where the problem happened?
sft with huggingface trainer
Is this a known issue?
- [X] I have followed the GitHub README.
- [X] I have checked the Qwen documentation and cannot find an answer there.
- [X] I have checked the documentation of the related framework and cannot find useful information.
- [X] I have searched the issues and there is not a similar one.
Information about environment
Doesn't matter.
Log output
Doesn't matter.
Description
Steps to reproduce
- Use Qwen2.5-7b-base
- Modify its eos_token from
<|endoftext|>
to<|im_end|>
(or any other specical token) in thetokenizer_config.json
- Use LoRA to fine-tune the model on your downstream task, makesure your LoRA will not fine-tune the lm_head and embedding.
- The fine-tuning data follow the input, instruction and output format and I place them with : input+instruction+output+eos_token, so do the labels.
Expected results
Expected: The fine-tuned model can produce text as you presented in the training data. Happend: The model can indeed generate appropriate text but cannot stop generation.
Attempts to fix
- check the generate function reveive the right eos_token_id
- check the training procedure get the right inputs_id and labels, make sure eos_token_id has been trained.
Final reason
If I change the eos_token back to <|endoftext|>
, the model will have the right behavior.
After careful inspection, I found the reason: Qwen2.5-7b-base has the same weight in lm_head and embedding for additional tokens like <|im_end|>
, <|object_ref_start|>
and so on except for <|endoftext|>
.
print("lm_head")
print("151643:"+str(model.lm_head.weight[151643]))
print("151644:"+str(model.lm_head.weight[151644]))
print("151645:"+str(model.lm_head.weight[151645]))
print("151646:"+str(model.lm_head.weight[151646]))
print("151647:"+str(model.lm_head.weight[151647]))
print("151648:"+str(model.lm_head.weight[151648]))
print("151649:"+str(model.lm_head.weight[151649]))
print("151650:"+str(model.lm_head.weight[151650]))
print("142333:"+str(model.lm_head.weight[142333]))
print("embedding")
print("151643:"+str(model.get_input_embeddings().weight[151643]))
print("151644:"+str(model.get_input_embeddings().weight[151644]))
print("151645:"+str(model.get_input_embeddings().weight[151645]))
print("151646:"+str(model.get_input_embeddings().weight[151646]))
print("151647:"+str(model.get_input_embeddings().weight[151647]))
print("151648:"+str(model.get_input_embeddings().weight[151648]))
print("151649:"+str(model.get_input_embeddings().weight[151649]))
print("151650:"+str(model.get_input_embeddings().weight[151650]))
print("142333:"+str(model.get_input_embeddings().weight[142333]))
the output
qwen2_base_7b
lm_head
151643:tensor([-0.0025, -0.0061, -0.0063, ..., -0.0042, -0.0118, 0.0019],
dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151644:tensor([ 0.0005, 0.0091, 0.0034, ..., 0.0020, 0.0002, -0.0011],
dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151645:tensor([ 0.0005, 0.0091, 0.0034, ..., 0.0020, 0.0002, -0.0011],
dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151646:tensor([ 0.0005, 0.0091, 0.0034, ..., 0.0020, 0.0002, -0.0011],
dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151647:tensor([ 0.0005, 0.0091, 0.0034, ..., 0.0020, 0.0002, -0.0011],
dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151648:tensor([ 0.0005, 0.0091, 0.0034, ..., 0.0020, 0.0002, -0.0011],
dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151649:tensor([ 0.0005, 0.0091, 0.0034, ..., 0.0020, 0.0002, -0.0011],
dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151650:tensor([ 0.0005, 0.0091, 0.0034, ..., 0.0020, 0.0002, -0.0011],
dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
142333:tensor([ 0.0005, 0.0091, 0.0034, ..., 0.0020, 0.0002, -0.0011],
dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
embedding
151643:tensor([-0.0186, 0.0347, 0.0092, ..., 0.0040, -0.0077, 0.0006],
dtype=torch.bfloat16, grad_fn=<SelectBackward0>)
151644:tensor([ 1.1755e-37, -1.1755e-37, 1.1755e-37, ..., 1.1755e-37,
-1.1755e-37, -1.1755e-37], dtype=torch.bfloat16,
grad_fn=<SelectBackward0>)
151645:tensor([-1.1755e-37, -1.1755e-37, 1.1755e-37, ..., 1.1755e-37,
-1.1755e-37, 1.1755e-37], dtype=torch.bfloat16,
grad_fn=<SelectBackward0>)
151646:tensor([-1.1755e-37, -1.1755e-37, 1.1755e-37, ..., 1.1755e-37,
1.1755e-37, 1.1755e-37], dtype=torch.bfloat16,
grad_fn=<SelectBackward0>)
151647:tensor([ 1.1755e-37, 1.1755e-37, -1.1755e-37, ..., 1.1755e-37,
1.1755e-37, -1.1755e-37], dtype=torch.bfloat16,
grad_fn=<SelectBackward0>)
151648:tensor([ 1.1755e-37, -1.1755e-37, 1.1755e-37, ..., 1.1755e-37,
-1.1755e-37, 1.1755e-37], dtype=torch.bfloat16,
grad_fn=<SelectBackward0>)
151649:tensor([ 1.1755e-37, -1.1755e-37, 1.1755e-37, ..., -1.1755e-37,
1.1755e-37, 1.1755e-37], dtype=torch.bfloat16,
grad_fn=<SelectBackward0>)
151650:tensor([-1.1755e-37, -1.1755e-37, 1.1755e-37, ..., -1.1755e-37,
1.1755e-37, -1.1755e-37], dtype=torch.bfloat16,
grad_fn=<SelectBackward0>)
142333:tensor([ 1.1755e-37, -1.1755e-37, -1.1755e-37, ..., 1.1755e-37,
1.1755e-37, 1.1755e-37], dtype=torch.bfloat16,
grad_fn=<SelectBackward0>)
151643 is the id of <|endoftext|>
, 142333 is a random id I pick (maybe there are more ids like this), other ids are defined in the tokenizer_config.json
. This explains why the fine-tuned model cannot stop generation: although the <|im_end|>
(151645) is trained, but its weight in lm_head is same with many other ids, when it should be generated at inference, its logits will also equal to those ids, thus any id can be picked during decoding. In fact I observe logits for 151645 indeed incerease when it should be generated, but the same for those ids that have the same lm_head weights. This will not happen for 151643 , because it have a different lm_head weight, maybe it is trained during the pre-training stage.
I am quite confused why this will happen since all lm_head and embedding weights are initialized by normal distribution according to the released code.