Baichuan2 扩充词表后训练loss为0

参考了 #49 中的提示进行词表扩充，但在训练时候过了若干step以后loss就直接变0了。具体操作步骤：

用tokenizer.add_tokens添加新词，然后save_pretrained，得到新词表的大小为NEW_VOCAB_SIZE
替换原模型config.json中的vocab_size为NEW_VOCAB_SIZE
替换原模型的tokenizer.model和tokenizer_config.json
替换lm_head.weights

model = torch.load(MODEL_DIR,"pytorch_model-00003-of-00003.bin")
lm_head_w = model['lm_head.weight']
# HIDDEN_SIZE = 5120
new_lm_head_w = torch.zeros([NEW_VOCAB_SIZE, HIDDEN_SIZE])
new_lm_head_w[:NEW_VOCAB_SIZE] = lm_head_w
model['lm_head.weight'] = new_lm_head_w
torch.save(model, NEW_MODEL_DIR,"pytorch_model-00003-of-00003.bin")

替换model.embed_tokens.weight(如果不替换训练时会报shape不一致的错）

model = torch.load(MODEL_DIR,"pytorch_model-00001-of-00003.bin")
embed_tokens_w = model['model.embed_tokens.weight']
# HIDDEN_SIZE = 5120
new_embed_tokens_w = torch.zeros([NEW_VOCAB_SIZE, HIDDEN_SIZE])
new_embed_tokens_w[:NEW_VOCAB_SIZE] = embed_tokens_w
model['model.embed_tokens.weight'] = new_embed_tokens_w
torch.save(model, NEW_MODEL_DIR,"pytorch_model-00001-of-00003.bin")

替换pytorch_model-00001-of-00003.bin和pytorch_model-00003-of-00003.bin这两个文件。
然后训练可以正常开始，前几个step有正常的loss日志，后面就开始loss全部为0了。

请教该如何解决，谢谢！

Sep 20 '23 06:09 andrea-veritas

@mmmans 主要是因为NormHead的类型不匹配，没法直接去调model.resize_token_embeddings() ，看了 #49 里的提示直接去改了权重文件里的数据，模型加载是能通过了，但是训练跑起来有问题。还望赐教，谢谢！

Sep 20 '23 06:09 andrea-veritas

new_embed_tokens_w = torch.zeros([NEW_VOCAB_SIZE, HIDDEN_SIZE]) new_embed_tokens_w[:NEW_VOCAB_SIZE] = embed_tokens_w

Your code init the embed and head weight to zero which may lead to optimization problem. A more resonable way may be

new_embed_tokens_w = torch.zeros([NEW_VOCAB_SIZE, HIDDEN_SIZE])
new_embed_tokens_w[:OLD_VOCAB_SIZE] = embed_tokens_w
new_embed_tokens_w[OLD_VOCAB_SIZE:NEW_VOCAB_SIZE] = embed_tokens_w.mean(dim=0,keepdim=True)

or

new_embed_tokens_w = torch.zeros([NEW_VOCAB_SIZE, HIDDEN_SIZE])
new_embed_tokens_w[:OLD_VOCAB_SIZE] = embed_tokens_w
new_embed_tokens_w[OLD_VOCAB_SIZE:NEW_VOCAB_SIZE] = torch.randn(....) # the default init policy

Btw, you can see if the logits collapes to zero?

Sep 20 '23 06:09 mmmans

@mmmans 主要是因为NormHead的类型不匹配，没法直接去调model.resize_token_embeddings() ，看了 #49 里的提示直接去改了权重文件里的数据，模型加载是能通过了，但是训练跑起来有问题。还望赐教，谢谢！

Can you provide the logits that lead to zero loss ?

Sep 20 '23 07:09 mmmans

new_embed_tokens_w = torch.zeros([NEW_VOCAB_SIZE, HIDDEN_SIZE]) new_embed_tokens_w[:NEW_VOCAB_SIZE] = embed_tokens_w

Your code init the embed and head weight to zero which may lead to optimization problem. A more resonable way may be
new_embed_tokens_w = torch.zeros([NEW_VOCAB_SIZE, HIDDEN_SIZE])
new_embed_tokens_w[:OLD_VOCAB_SIZE] = embed_tokens_w
new_embed_tokens_w[OLD_VOCAB_SIZE:NEW_VOCAB_SIZE] = embed_tokens_w.mean(dim=0,keepdim=True)
or
new_embed_tokens_w = torch.zeros([NEW_VOCAB_SIZE, HIDDEN_SIZE])
new_embed_tokens_w[:OLD_VOCAB_SIZE] = embed_tokens_w
new_embed_tokens_w[OLD_VOCAB_SIZE:NEW_VOCAB_SIZE] = torch.randn(....) # the default init policy
Btw, you can see if the logits collapes to zero?

😊 thx, I will try them and bring the results back here asap.

Sep 20 '23 09:09 andrea-veritas

Added the following lines:

new_embed_tokens_w[OLD_VOCAB_SIZE:NEW_VOCAB_SIZE] = embed_tokens_w.mean(dim=0,keepdim=True)
new_lm_head_w[OLD_VOCAB_SIZE:NEW_VOCAB_SIZE] = lm_head_w.mean(dim=0, keepdim=True)

and now everything's looking good after 0.1x epoch. will run some inferences after training.

Sep 20 '23 10:09 andrea-veritas

Added the following lines:
new_embed_tokens_w[OLD_VOCAB_SIZE:NEW_VOCAB_SIZE] = embed_tokens_w.mean(dim=0,keepdim=True)
new_lm_head_w[OLD_VOCAB_SIZE:NEW_VOCAB_SIZE] = lm_head_w.mean(dim=0, keepdim=True)
and now everything's looking good after 0.1x epoch. will run some inferences after training.

I am just curious how would you get a 0 loss? \sum y_i*log(p_i) = 0 means your prediction is one hot and correct. So if you have time to reproduce 0 loss. I am happy to see what happened behind.

Sep 20 '23 10:09 mmmans

got good news and bad news. the good news: after expanding the tokenizer and altering the model bin files accordingly, the training and inference process are both working fine. the training tool that I am using is LLama Efficient Tuning. I've run pt, sft, and dpo stages with lora method and nothing abnormal happened.

the bad news: after training, corpus including the new tokens, the altered model seems to forget everything about the newly added tokens. I tried to add background texts describing the new tokens to the dialog context, the model simply ignores the new tokens as if there were negative prompts.

three factors I am guessing :

the new rows in model.embed_tokens.weight and lm_head.weight should be initialized with some other values for the model seems to avoid the new tokens now.
the lora fine-tuning method doesn't affect those new tokens. I will checkout full parameter tuning later.
wondering how many texts I should provide in the pretrain stage, and how many QA pairs in the sft stage for the new tokens to be activated. it's just weird that the new tokens may take negative effects.

Sep 21 '23 05:09 andrea-veritas

Added the following lines:
new_embed_tokens_w[OLD_VOCAB_SIZE:NEW_VOCAB_SIZE] = embed_tokens_w.mean(dim=0,keepdim=True)
new_lm_head_w[OLD_VOCAB_SIZE:NEW_VOCAB_SIZE] = lm_head_w.mean(dim=0, keepdim=True)
and now everything's looking good after 0.1x epoch. will run some inferences after training.
I am just curious how would you get a 0 loss? \sum y_i*log(p_i) = 0 means your prediction is one hot and correct. So if you have time to reproduce 0 loss. I am happy to see what happened behind.

sorry it's probably my fault .... I've mistaken

new_lm_head_w[:OLD_VOCAB_SIZE] = lm_head_w

for

new_lm_head_w[:NEW_VOCAB_SIZE] = lm_head_w

at the first time. When I corrected this line but still leave the new rows all zeros, the loss turns out to be a non-zero constant. (maybe 11.xx as I can remember) @mmmans

Sep 21 '23 05:09 andrea-veritas

感谢，按照楼主的做法解决了，太感谢了

Nov 22 '23 09:11 cgt-woailol

感谢，按照楼主的做法解决了，太感谢了

你有没有发现新增加的词在推理结果里面会多空格出来？

Nov 28 '23 05:11 andrea-veritas

感谢，按照楼主的做法解决了，太感谢了

你有没有发现新增加的词在推理结果里面会多空格出来？

我刚刚看了一下，我预测出来的special token确实会多出来空格，不过对我的实验结果好像没有啥影响

Nov 28 '23 07:11 cgt-woailol

@rogerus 我也在用这种方式添加新token，但是发现用lora微调，是不是不会训练新token呀；求问你试过全量微调了吗？

Dec 29 '23 07:12 wanghao19970205

Added the following lines:
new_embed_tokens_w[OLD_VOCAB_SIZE:NEW_VOCAB_SIZE] = embed_tokens_w.mean(dim=0,keepdim=True)
new_lm_head_w[OLD_VOCAB_SIZE:NEW_VOCAB_SIZE] = lm_head_w.mean(dim=0, keepdim=True)
and now everything's looking good after 0.1x epoch. will run some inferences after training.

这么做会引起一个问题：当有多个特殊token的时候，那些特殊token之间就没有任何区分度了，因为他们的embed和lm_head输出是一模一样的

Jan 26 '24 16:01 1190300611

使用下面的方法保存了新的tokenizer和模型权重，正常训练后发现生成到新添加的special_token时会自动停止生成，请问各位有遇到这种情况嘛？

special_tokens_dict = {'additional_special_tokens': ['<answer>', '<rationale>']}
OLD_VOCAB_SIZE = len(tokenizer)
print("------OLD_VOCAB_SIZE------", len(tokenizer))
tokenizer.add_special_tokens(special_tokens_dict)
NEW_VOCAB_SIZE = len(tokenizer)
print("------NEW_VOCAB_SIZE------", len(tokenizer))
tokenizer.save_pretrained(save_directory)

HIDDEN_SIZE = model.config.hidden_size
lm_head_w = model.get_output_embeddings().weight.data
print(lm_head_w.shape)
new_lm_head_w = torch.zeros([NEW_VOCAB_SIZE, HIDDEN_SIZE])
new_lm_head_w[:OLD_VOCAB_SIZE] = lm_head_w
new_lm_head_w[OLD_VOCAB_SIZE: NEW_VOCAB_SIZE] = lm_head_w.mean(dim=0, keepdim=True)
model.get_output_embeddings().weight.data = new_lm_head_w

embed_tokens_w = model.get_input_embeddings().weight.data
print(embed_tokens_w.shape)
new_embed_tokens_w = torch.zeros([NEW_VOCAB_SIZE, HIDDEN_SIZE])
new_embed_tokens_w[:OLD_VOCAB_SIZE] = embed_tokens_w
new_embed_tokens_w[OLD_VOCAB_SIZE: NEW_VOCAB_SIZE] = embed_tokens_w.mean(dim=0,keepdim=True)
model.get_input_embeddings().weight.data = new_embed_tokens_w
model.save_pretrained(save_directory)

Jul 18 '24 06:07 Cheung-Z

使用下面的方法保存了新的tokenizer和模型权重，正常训练后发现生成到新添加的special_token时会自动停止生成，请问各位有遇到这种情况嘛？ ` """ Resize tokenizer and embedding. """ special_tokens_dict = {'additional_special_tokens': ['', '']} OLD_VOCAB_SIZE = len(tokenizer) print("------OLD_VOCAB_SIZE------", len(tokenizer)) tokenizer.add_special_tokens(special_tokens_dict) NEW_VOCAB_SIZE = len(tokenizer) print("------NEW_VOCAB_SIZE------", len(tokenizer)) tokenizer.save_pretrained(save_directory)
HIDDEN_SIZE = model.config.hidden_size
lm_head_w = model.get_output_embeddings().weight.data
print(lm_head_w.shape)
new_lm_head_w = torch.zeros([NEW_VOCAB_SIZE, HIDDEN_SIZE])
new_lm_head_w[:OLD_VOCAB_SIZE] = lm_head_w
new_lm_head_w[OLD_VOCAB_SIZE: NEW_VOCAB_SIZE] = lm_head_w.mean(dim=0, keepdim=True)
model.get_output_embeddings().weight.data = new_lm_head_w

embed_tokens_w = model.get_input_embeddings().weight.data
print(embed_tokens_w.shape)
new_embed_tokens_w = torch.zeros([NEW_VOCAB_SIZE, HIDDEN_SIZE])
new_embed_tokens_w[:OLD_VOCAB_SIZE] = embed_tokens_w
new_embed_tokens_w[OLD_VOCAB_SIZE: NEW_VOCAB_SIZE] = embed_tokens_w.mean(dim=0,keepdim=True)
model.get_input_embeddings().weight.data = new_embed_tokens_w
model.save_pretrained(save_directory)`

有些推理引擎有stop on special token的选项，可以看看有没有打开相应的设置。

Jul 18 '24 06:07 andrea-veritas

使用下面的方法保存了新的tokenizer和模型权重，正常训练后发现生成到新添加的special_token时会自动停止生成，请问各位有遇到这种情况嘛？ ` """ Resize tokenizer and embedding. """ special_tokens_dict = {'additional_special_tokens': ['', '']} OLD_VOCAB_SIZE = len(tokenizer) print("------OLD_VOCAB_SIZE------", len(tokenizer)) tokenizer.add_special_tokens(special_tokens_dict) NEW_VOCAB_SIZE = len(tokenizer) print("------NEW_VOCAB_SIZE------", len(tokenizer)) tokenizer.save_pretrained(save_directory)
HIDDEN_SIZE = model.config.hidden_size
lm_head_w = model.get_output_embeddings().weight.data
print(lm_head_w.shape)
new_lm_head_w = torch.zeros([NEW_VOCAB_SIZE, HIDDEN_SIZE])
new_lm_head_w[:OLD_VOCAB_SIZE] = lm_head_w
new_lm_head_w[OLD_VOCAB_SIZE: NEW_VOCAB_SIZE] = lm_head_w.mean(dim=0, keepdim=True)
model.get_output_embeddings().weight.data = new_lm_head_w

embed_tokens_w = model.get_input_embeddings().weight.data
print(embed_tokens_w.shape)
new_embed_tokens_w = torch.zeros([NEW_VOCAB_SIZE, HIDDEN_SIZE])
new_embed_tokens_w[:OLD_VOCAB_SIZE] = embed_tokens_w
new_embed_tokens_w[OLD_VOCAB_SIZE: NEW_VOCAB_SIZE] = embed_tokens_w.mean(dim=0,keepdim=True)
model.get_input_embeddings().weight.data = new_embed_tokens_w
model.save_pretrained(save_directory)`
有些推理引擎有stop on special token的选项，可以看看有没有打开相应的设置。

thx，查到原因了，确实是代码里的问题 gen_kwargs["eos_token_id"] = [tokenizer.eos_token_id] + tokenizer.additional_special_tokens_ids

Jul 18 '24 07:07 Cheung-Z

想问下，用tokenizer.add_tokens方法添加了新的token之后，发现tokenizer.vocab_size没有变，这个算添加成功了没有，后面的embedding矩阵可以按照len(tokenizer)来resize吗

Aug 30 '24 09:08 AaronWhite95

tokenizer.vocab_size获取的还是原始config文件的数据，直接打印前后的len(tokenizer)，看有没有变化就知道是否添加了

Aug 30 '24 09:08 Cheung-Z

Baichuan2 Baichuan2 copied to clipboard

扩充词表后训练loss为0

Baichuan2
Baichuan2 copied to clipboard