MedicalGPT 使用merge_peft_adapter.py进行merge的时候，词表映射出现了问题

使用merge_peft_adapter.py进行merge的时候，词表映射出现了问题

Open nlper-hou opened this issue 1 year ago • 2 comments

Describe the Question

使用医疗数据二次预训练之后，使用merge_peft_adapter.py将训练好的模型与llama-7b进行mearge，出现了下面的问题。

Describe your attempts

Traceback (most recent call last): File "/root/nas/llm-prompt/MedicalGPT-main/scripts/merge_peft_adapter.py", line 102, in main() File "/root/nas/llm-prompt/MedicalGPT-main/scripts/merge_peft_adapter.py", line 79, in main tokenizer = tokenizer_class.from_pretrained(peft_model_path, trust_remote_code=True) File "/opt/conda/envs/torch/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1825, in from_pretrained return cls._from_pretrained( File "/opt/conda/envs/torch/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2044, in _from_pretrained raise ValueError( ValueError: Non-consecutive added token '' found. Should have index 32000 but has index 39408 in saved vocabulary.

这个错误提示应该是词汇映射到索引时出现了问题。错误提示表明文本中添加了一个名为""的标记，但它在词汇表中被分配了索引39408，而不是32000。

Jun 16 '23 08:06 nlper-hou

MedicalGPT MedicalGPT copied to clipboard

使用merge_peft_adapter.py进行merge的时候，词表映射出现了问题

Describe the Question

Describe your attempts

MedicalGPT
MedicalGPT copied to clipboard