simpletransformers icon indicating copy to clipboard operation
simpletransformers copied to clipboard

sentencepiece tokenizer issue

Open mosharafhossain opened this issue 4 years ago • 17 comments

Hello,

I was exploring some NLP problems with simpletransformers package. It looks like there is an issue with sentencepiece tokenizer while using T5 and ALBERT models.

Environment python/3.7.4 cuda/102/toolkit/10.2.89 cudnn/7.6.5/cuda102 sentencepiece==0.1.91 (issue persists for sentencepiece==0.1.94 as well) simpletransformers==0.50.0 torch==1.7.0 torchvision==0.8.1 transformers==4.0.0 tokenizers==0.9.4

I followed below link for Question-answering problem. Link: https://towardsdatascience.com/question-answering-with-bert-xlnet-xlm-and-distilbert-using-simple-transformers-4d8785ee762a

Issue starts with the below line: model = QuestionAnsweringModel('albert', '/work/user_id/Research/models/transformers/albert-large-v2', args=train_args)

Traceback (most recent call last): File "main.py", line 45, in model = QuestionAnsweringModel('albert', '/work/user_id/Research/models/transformers/albert-large-v2', args=train_args) File "/work/user_id/Softwares/Installed_softwares/Python/virtual_envs/python37/simpletransformers/lib/python3.7/site-packages/simpletransformers/question_answering/question_answering_model.py", line 188, in init self.tokenizer = tokenizer_class.from_pretrained(model_name, do_lower_case=self.args.do_lower_case, **kwargs) File "/work/user_id/Softwares/Installed_softwares/Python/virtual_envs/python37/simpletransformers/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1771, in from_pretrained resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs File "/work/user_id/Softwares/Installed_softwares/Python/virtual_envs/python37/simpletransformers/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1843, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File "/work/user_id/Softwares/Installed_softwares/Python/virtual_envs/python37/simpletransformers/lib/python3.7/site-packages/transformers/models/albert/tokenization_albert.py", line 149, in init self.sp_model.Load(vocab_file) File "/work/user_id/Softwares/Installed_softwares/Python/virtual_envs/python37/simpletransformers/lib/python3.7/site-packages/sentencepiece.py", line 367, in Load return self.LoadFromFile(model_file) File "/work/user_id/Softwares/Installed_softwares/Python/virtual_envs/python37/simpletransformers/lib/python3.7/site-packages/sentencepiece.py", line 177, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) TypeError: not a string

I followed below link for Question-generation problem. https://towardsdatascience.com/asking-the-right-questions-training-a-t5-transformer-model-on-a-new-task-691ebba2d72c Issue starts with the below line: model = T5Model('/work/user_id/Research/models/transformers/t5-large', args=model_args)

Traceback (most recent call last): File "", line 1, in File "/work/user_id/Softwares/Installed_softwares/Python/virtual_envs/python37/simpletransformers/lib/python3.7/site-packages/simpletransformers/t5/t5_model.py", line 107, in init self.tokenizer = T5Tokenizer.from_pretrained(model_name, truncate=True) File "/work/user_id/Softwares/Installed_softwares/Python/virtual_envs/python37/simpletransformers/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1771, in from_pretrained resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs File "/work/user_id/Softwares/Installed_softwares/Python/virtual_envs/python37/simpletransformers/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1843, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File "/work/user_id/Softwares/Installed_softwares/Python/virtual_envs/python37/simpletransformers/lib/python3.7/site-packages/transformers/models/t5/tokenization_t5.py", line 139, in init self.sp_model.Load(vocab_file) File "/work/user_id/Softwares/Installed_softwares/Python/virtual_envs/python37/simpletransformers/lib/python3.7/site-packages/sentencepiece.py", line 367, in Load return self.LoadFromFile(model_file) File "/work/user_id/Softwares/Installed_softwares/Python/virtual_envs/python37/simpletransformers/lib/python3.7/site-packages/sentencepiece.py", line 177, in LoadFromFile return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg) TypeError: not a string

Can anyone guide me to solve this issue? Thanks.

mosharafhossain avatar Dec 02 '20 21:12 mosharafhossain

Does this happen for newly trained models as well? Newly trained as in, if you train a model with the latest library versions and try to load it.

ThilinaRajapakse avatar Dec 09 '20 14:12 ThilinaRajapakse

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Feb 07 '21 18:02 stale[bot]

I have the same issue when loading t5

TypeError                                 Traceback (most recent call last)
<ipython-input-15-3b5bde898df7> in <module>
----> 1 tokenizer = T5Tokenizer.from_pretrained('./t5-base/')
      2 model = T5forDrop.from_pretrained('./t5-base')#,return_dict=True)

C:\Apps\Anaconda3\v3_8_5_x64\Local\envs\protocol_prediction\lib\site-packages\transformers\tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1731 
   1732         return cls._from_pretrained(
-> 1733             resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs
   1734         )
   1735 

C:\Apps\Anaconda3\v3_8_5_x64\Local\envs\protocol_prediction\lib\site-packages\transformers\tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs)
   1848         # Instantiate tokenizer.
   1849         try:
-> 1850             tokenizer = cls(*init_inputs, **init_kwargs)
   1851         except OSError:
   1852             raise OSError(

C:\Apps\Anaconda3\v3_8_5_x64\Local\envs\protocol_prediction\lib\site-packages\transformers\models\t5\tokenization_t5.py in __init__(self, vocab_file, eos_token, unk_token, pad_token, extra_ids, additional_special_tokens, sp_model_kwargs, **kwargs)
    146 
    147         self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
--> 148         self.sp_model.Load(vocab_file)
    149 
    150     @property

C:\Apps\Anaconda3\v3_8_5_x64\Local\envs\protocol_prediction\lib\site-packages\sentencepiece\__init__.py in Load(self, model_file, model_proto)
    365       if model_proto:
    366         return self.LoadFromSerializedProto(model_proto)
--> 367       return self.LoadFromFile(model_file)
    368 
    369 

C:\Apps\Anaconda3\v3_8_5_x64\Local\envs\protocol_prediction\lib\site-packages\sentencepiece\__init__.py in LoadFromFile(self, arg)
    169 
    170     def LoadFromFile(self, arg):
--> 171         return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
    172 
    173     def DecodeIdsWithCheck(self, ids):

TypeError: not a string

phillipshaong avatar Jul 29 '21 01:07 phillipshaong

Same error, with recent model too.

Hugo0 avatar Aug 23 '21 07:08 Hugo0

Same error,

    self.sp_model.Load(sp_model_path)
  File "/Users/lipf/anaconda3/lib/python3.7/site-packages/sentencepiece/__init__.py", line 367, in Load
    return self.LoadFromFile(model_file)
  File "/Users/lipf/anaconda3/lib/python3.7/site-packages/sentencepiece/__init__.py", line 171, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

l-i-p-f avatar May 02 '22 13:05 l-i-p-f

Same error,

    self.sp_model.Load(sp_model_path)
  File "/Users/lipf/anaconda3/lib/python3.7/site-packages/sentencepiece/__init__.py", line 367, in Load
    return self.LoadFromFile(model_file)
  File "/Users/lipf/anaconda3/lib/python3.7/site-packages/sentencepiece/__init__.py", line 171, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

Finally, I solve the case by meself.

The idea is that the sentencepiece is validating my param sp_model_path must be string. However, the sp_model_path is Posixpath here. Therefore, I change the code from

self.sp_model.Load(sp_model_path)

to

self.sp_model.Load(sp_model_path.as_posix())

It works.

Come back to the issue case, I advice to check the code of file 'site-packages/transformers/models/t5/tokenization_t5.py'. To find out if the vocab_file is using package pathilb.Path which cause it to a non-string.

File "/work/user_id/Softwares/Installed_softwares/Python/virtual_envs/python37/simpletransformers/lib/python3.7/site-packages/transformers/models/t5/tokenization_t5.py", line 139, in init
self.sp_model.Load(vocab_file)

l-i-p-f avatar May 02 '22 13:05 l-i-p-f

+1

>>> tokenizer = LlamaTokenizer.from_pretrained("/home/root1/models/llama-7b-hf/")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/root1/software/miniconda3/envs/model_eval/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1805, in from_pretrained
    return cls._from_pretrained(
  File "/home/root1/software/miniconda3/envs/model_eval/lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1959, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/root1/software/miniconda3/envs/model_eval/lib/python3.9/site-packages/transformers/models/llama/tokenization_llama.py", line 71, in __init__
    self.sp_model.Load(vocab_file)
  File "/home/root1/software/miniconda3/envs/model_eval/lib/python3.9/site-packages/sentencepiece/__init__.py", line 905, in Load
    return self.LoadFromFile(model_file)
  File "/home/root1/software/miniconda3/envs/model_eval/lib/python3.9/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

liuxiaocs7 avatar Mar 27 '23 10:03 liuxiaocs7

same error with trying the llama tokenizer

ReconIII avatar Apr 02 '23 21:04 ReconIII

I realized inserting some prints along the sourcecode that it is expecting a sentencepiece model. But some models do not have it. Only tokenizer.json, tokenizer_config.json and special_tokens_map.json ... Looking forward on how to modify it

fernando-neto-ai avatar May 09 '23 21:05 fernando-neto-ai

i have the same error when i just download one model, so i download all the file int the dir, it's solved

tlaymedown avatar Jul 06 '23 01:07 tlaymedown

same error since i did not miss any files in model file path

Zhanghahah avatar Aug 15 '23 12:08 Zhanghahah

same error when loading t5 model

mhdi707 avatar Oct 09 '23 18:10 mhdi707

same error loading t5, anyone solved this?

quanmai avatar Nov 08 '23 16:11 quanmai

Is this happening when loading a T5 model with Simple Transformers?

e.g.:

from simpletransformers.t5 import T5Model


model = T5Model("t5", "t5-base")

ThilinaRajapakse avatar Nov 11 '23 00:11 ThilinaRajapakse

Hi everyone, I got the same error before. But I forgot to download the tokenizer.model (it is a git LFS file, so lower versions of git may ignore it while doing git clone), so I suppose anyone who got the incomplete file may encounter the same error. So make sure the file exists and the md5 checksum matches.

john-theo avatar Nov 11 '23 12:11 john-theo

To anyone who has an issue, you need to rename your vocab file as tokenizer.model for llama, coz thats the default name it searches for

meyvan avatar Mar 19 '24 10:03 meyvan

Solved this quesiton, use AutoTokenizer.from_pretrained(model_id)

ghLcd9dG avatar Apr 24 '24 09:04 ghLcd9dG