BioGPT
BioGPT copied to clipboard
Unable to convert BioGpt slow tokenizer to fast: token out of vocabulary
Hi @themanojkumar ,
I was trying to use BioGpt model in my QA task for fine-tuning. I would like to use the tokenizer as a fast tokenizer, so that I could use the offsets_mapping to know from which words the tokens do origin. But unfortunately, when creating a BiogptTokenizerFast from the PreTrainedTokenizerFast by convert_slow_tokenizer
, following error occurs: Error while initializing BPE: Token -@</w>
out of vocabulary.
According to this issue https://github.com/huggingface/transformers/issues/9290, this problem might be caused by some missing tokens. Could you please check it? Thank you very much!
Environment
transformers
version: 4.25.0
Error trace
Traceback (most recent call last):
File "run.py", line 124, in <module>
trainer, predict_dataset = get_trainer(args)
File "***/tasks/qa/get_trainer.py", line 31, in get_trainer
tokenizer = BioGptTokenizerFast.from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py", line 1801, in from_pretrained
return cls._from_pretrained(
File "/usr/local/lib/python3.8/dist-packages/transformers/tokenization_utils_base.py", line 1956, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "***/model/biogpt/tokenization_biogpt_fast.py", line 117, in __init__
super().__init__(
File "***/model/biogpt/tokenization_utils_fast.py", line 114, in __init__
fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
File "***/model/biogpt/convert_slow_tokenizer.py", line 1198, in convert_slow_tokenizer
return converter_class(transformer_tokenizer).converted()
File "***/model/biogpt/convert_slow_tokenizer.py", line 273, in converted
BPE(
Exception: Error while initializing BPE: Token `-@</w>` out of vocabulary
Colab code for reproduction:
https://colab.research.google.com/drive/1IMhiDz45GiarBLgXG9B2rA_u0ZOmmjJS?usp=sharing
I am also facing same problem, Do you have any update ?