FlagEmbedding 训练时提示“Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.”，推理时报错找不到tokenzier

训练时提示“Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.”，推理时报错找不到tokenzier

Open blue-vision0 opened this issue 1 year ago • 3 comments

训练的时候就有提示 “Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained”

推理的时候，使用From_pretrained_tokenizer的时候会报错os.error，找不到文件。经过排查，发现是保存的finetune过的模型中，tokenizer_config.json文件与基座模型的tokenizer_config.json不同。似乎表明其依赖原来基座模型的tokenizer_config.json文件。因为我训练和部署的机器不同，所以找不到这个文件就报错了。当我把基座模型的文件放到对应位置，模型才能正常使用。感觉这是一个BUG，我之前用早期版本的bge脚本训练的时候并没有这个问题，tokenizer_config文件并不会改变。我使用的库版本如下： sentence-transformers 2.2.2 transformers 4.34.0 FlagEmbedding 1.1.8

Jan 23 '24 09:01 blue-vision0

有两种办法解决推理的时候找不到Tokenizer的问题

把Finetune之后的模型文件中的tokenizer_config.json替换为之前基座的tokenizer_config.json（内容是这样） 2.查看Finetune之后的模型的tokenizer_config.json里面写的tokenizer_file的位置，然后把基座模型的tokenizer_config.json放到对应位置即可

Jan 23 '24 10:01 blue-vision0

请问作者，这算是bug吗？有没有解决方案，之前训练的时候从不会报“ Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained”，也没有引发后续的问题

Jan 23 '24 10:01 blue-vision0

这个警告是由于transformers版本引起的，transformers会自动往tokenizer_config.json加上一些内容。 tokenizer_file是由于transformers4.34引起的，可以升级到4.35就没有问题了，或者降到4.33，不会添加任何东西。

Jan 23 '24 13:01 staoxiao

FlagEmbedding FlagEmbedding copied to clipboard

训练时提示“Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.”，推理时报错找不到tokenzier

FlagEmbedding
FlagEmbedding copied to clipboard