MedicalGPT icon indicating copy to clipboard operation
MedicalGPT copied to clipboard

使用merge_peft_adapter.py进行merge的时候,词表映射出现了问题

Open nlper-hou opened this issue 1 year ago • 2 comments

Describe the Question

使用医疗数据二次预训练之后,使用merge_peft_adapter.py将训练好的模型与llama-7b进行mearge,出现了下面的问题。

Describe your attempts

Traceback (most recent call last): File "/root/nas/llm-prompt/MedicalGPT-main/scripts/merge_peft_adapter.py", line 102, in main() File "/root/nas/llm-prompt/MedicalGPT-main/scripts/merge_peft_adapter.py", line 79, in main tokenizer = tokenizer_class.from_pretrained(peft_model_path, trust_remote_code=True) File "/opt/conda/envs/torch/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1825, in from_pretrained return cls._from_pretrained( File "/opt/conda/envs/torch/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2044, in _from_pretrained raise ValueError( ValueError: Non-consecutive added token '' found. Should have index 32000 but has index 39408 in saved vocabulary.

这个错误提示应该是词汇映射到索引时出现了问题。 错误提示表明文本中添加了一个名为""的标记,但它在词汇表中被分配了索引39408,而不是32000。

nlper-hou avatar Jun 16 '23 08:06 nlper-hou