simpletransformers
simpletransformers copied to clipboard
sentencepiece tokenizer issue
Hello,
I was exploring some NLP problems with simpletransformers package. It looks like there is an issue with sentencepiece tokenizer while using T5 and ALBERT models.
Environment python/3.7.4 cuda/102/toolkit/10.2.89 cudnn/7.6.5/cuda102 sentencepiece==0.1.91 (issue persists for sentencepiece==0.1.94 as well) simpletransformers==0.50.0 torch==1.7.0 torchvision==0.8.1 transformers==4.0.0 tokenizers==0.9.4
I followed below link for Question-answering problem. Link: https://towardsdatascience.com/question-answering-with-bert-xlnet-xlm-and-distilbert-using-simple-transformers-4d8785ee762a
Issue starts with the below line: model = QuestionAnsweringModel('albert', '/work/user_id/Research/models/transformers/albert-large-v2', args=train_args)
Traceback (most recent call last):
File "main.py", line 45, in
I followed below link for Question-generation problem. https://towardsdatascience.com/asking-the-right-questions-training-a-t5-transformer-model-on-a-new-task-691ebba2d72c Issue starts with the below line: model = T5Model('/work/user_id/Research/models/transformers/t5-large', args=model_args)
Traceback (most recent call last):
File "
Can anyone guide me to solve this issue? Thanks.
Does this happen for newly trained models as well? Newly trained as in, if you train a model with the latest library versions and try to load it.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I have the same issue when loading t5
TypeError Traceback (most recent call last)
<ipython-input-15-3b5bde898df7> in <module>
----> 1 tokenizer = T5Tokenizer.from_pretrained('./t5-base/')
2 model = T5forDrop.from_pretrained('./t5-base')#,return_dict=True)
C:\Apps\Anaconda3\v3_8_5_x64\Local\envs\protocol_prediction\lib\site-packages\transformers\tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
1731
1732 return cls._from_pretrained(
-> 1733 resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs
1734 )
1735
C:\Apps\Anaconda3\v3_8_5_x64\Local\envs\protocol_prediction\lib\site-packages\transformers\tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs)
1848 # Instantiate tokenizer.
1849 try:
-> 1850 tokenizer = cls(*init_inputs, **init_kwargs)
1851 except OSError:
1852 raise OSError(
C:\Apps\Anaconda3\v3_8_5_x64\Local\envs\protocol_prediction\lib\site-packages\transformers\models\t5\tokenization_t5.py in __init__(self, vocab_file, eos_token, unk_token, pad_token, extra_ids, additional_special_tokens, sp_model_kwargs, **kwargs)
146
147 self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
--> 148 self.sp_model.Load(vocab_file)
149
150 @property
C:\Apps\Anaconda3\v3_8_5_x64\Local\envs\protocol_prediction\lib\site-packages\sentencepiece\__init__.py in Load(self, model_file, model_proto)
365 if model_proto:
366 return self.LoadFromSerializedProto(model_proto)
--> 367 return self.LoadFromFile(model_file)
368
369
C:\Apps\Anaconda3\v3_8_5_x64\Local\envs\protocol_prediction\lib\site-packages\sentencepiece\__init__.py in LoadFromFile(self, arg)
169
170 def LoadFromFile(self, arg):
--> 171 return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
172
173 def DecodeIdsWithCheck(self, ids):
TypeError: not a string
Same error, with recent model too.
Same error,
self.sp_model.Load(sp_model_path)
File "/Users/lipf/anaconda3/lib/python3.7/site-packages/sentencepiece/__init__.py", line 367, in Load
return self.LoadFromFile(model_file)
File "/Users/lipf/anaconda3/lib/python3.7/site-packages/sentencepiece/__init__.py", line 171, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string
Same error,
self.sp_model.Load(sp_model_path) File "/Users/lipf/anaconda3/lib/python3.7/site-packages/sentencepiece/__init__.py", line 367, in Load return self.LoadFromFile(model_file) File "/Users/lipf/anaconda3/lib/python3.7/site-packages/sentencepiece/__init__.py", line 171, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) TypeError: not a string
Finally, I solve the case by meself.
The idea is that the sentencepiece is validating my param sp_model_path
must be string. However, the sp_model_path
is Posixpath
here. Therefore, I change the code from
self.sp_model.Load(sp_model_path)
to
self.sp_model.Load(sp_model_path.as_posix())
It works.
Come back to the issue case, I advice to check the code of file 'site-packages/transformers/models/t5/tokenization_t5.py'. To find out if the vocab_file
is using package pathilb.Path
which cause it to a non-string.
File "/work/user_id/Softwares/Installed_softwares/Python/virtual_envs/python37/simpletransformers/lib/python3.7/site-packages/transformers/models/t5/tokenization_t5.py", line 139, in init
self.sp_model.Load(vocab_file)
+1
>>> tokenizer = LlamaTokenizer.from_pretrained("/home/root1/models/llama-7b-hf/")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/root1/software/miniconda3/envs/model_eval/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1805, in from_pretrained
return cls._from_pretrained(
File "/home/root1/software/miniconda3/envs/model_eval/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1959, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/root1/software/miniconda3/envs/model_eval/lib/python3.9/site-packages/transformers/models/llama/tokenization_llama.py", line 71, in __init__
self.sp_model.Load(vocab_file)
File "/home/root1/software/miniconda3/envs/model_eval/lib/python3.9/site-packages/sentencepiece/__init__.py", line 905, in Load
return self.LoadFromFile(model_file)
File "/home/root1/software/miniconda3/envs/model_eval/lib/python3.9/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string
same error with trying the llama tokenizer
I realized inserting some prints along the sourcecode that it is expecting a sentencepiece model. But some models do not have it. Only tokenizer.json, tokenizer_config.json and special_tokens_map.json ... Looking forward on how to modify it
i have the same error when i just download one model, so i download all the file int the dir, it's solved
same error since i did not miss any files in model file path
same error when loading t5 model
same error loading t5, anyone solved this?
Is this happening when loading a T5 model with Simple Transformers?
e.g.:
from simpletransformers.t5 import T5Model
model = T5Model("t5", "t5-base")
Hi everyone, I got the same error before. But I forgot to download the tokenizer.model
(it is a git LFS file, so lower versions of git may ignore it while doing git clone
), so I suppose anyone who got the incomplete file may encounter the same error. So make sure the file exists and the md5 checksum matches.
To anyone who has an issue, you need to rename your vocab file as tokenizer.model
for llama, coz thats the default name it searches for
Solved this quesiton, use AutoTokenizer.from_pretrained(model_id)