Describe the Question
使用医疗数据二次预训练之后,使用merge_peft_adapter.py将训练好的模型与llama-7b进行mearge,出现了下面的问题。
Describe your attempts
Traceback (most recent call last):
File "/root/nas/llm-prompt/MedicalGPT-main/scripts/merge_peft_adapter.py", line 102, in
main()
File "/root/nas/llm-prompt/MedicalGPT-main/scripts/merge_peft_adapter.py", line 79, in main
tokenizer = tokenizer_class.from_pretrained(peft_model_path, trust_remote_code=True)
File "/opt/conda/envs/torch/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1825, in from_pretrained
return cls._from_pretrained(
File "/opt/conda/envs/torch/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2044, in _from_pretrained
raise ValueError(
ValueError: Non-consecutive added token '' found. Should have index 32000 but has index 39408 in saved vocabulary.
这个错误提示应该是词汇映射到索引时出现了问题。
错误提示表明文本中添加了一个名为""的标记,但它在词汇表中被分配了索引39408,而不是32000。