OpenPrompt icon indicating copy to clipboard operation
OpenPrompt copied to clipboard

Can't load tokenizer for 'xlm-roberta-base'.

Open cmgchess opened this issue 3 years ago • 10 comments

This is what I get when trying to load xlm-roberta-base

from openprompt.plms import load_plm
plm, tokenizer, model_config, WrapperClass = load_plm("roberta", "xlm-roberta-base")
OSError                                   Traceback (most recent call last)
[<ipython-input-3-bc593607bff3>](https://localhost:8080/#) in <module>
      1 from openprompt.plms import load_plm
----> 2 plm, tokenizer, model_config, WrapperClass = load_plm("roberta", "xlm-roberta-base")

1 frames
[/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py](https://localhost:8080/#) in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1758         if all(full_file_name is None for full_file_name in resolved_vocab_files.values()):
   1759             raise EnvironmentError(
-> 1760                 f"Can't load tokenizer for '{pretrained_model_name_or_path}'. If you were trying to load it from "
   1761                 "'https://huggingface.co/models', make sure you don't have a local directory with the same name. "
   1762                 f"Otherwise, make sure '{pretrained_model_name_or_path}' is the correct path to a directory "

OSError: Can't load tokenizer for 'xlm-roberta-base'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'xlm-roberta-base' is the correct path to a directory containing all relevant files for a RobertaTokenizer tokenizer.

image

Help is much appreciated, Thanks

cmgchess avatar Oct 14 '22 05:10 cmgchess

look into https://github.com/thunlp/OpenPrompt/blob/main/openprompt/plms/init.py#L87. The 'xlm-roberta-base' is not the same as "roberta", it uses XLMRobertaConfig other than RobertaConfig, XLMRobertaTokenizer instead of RobertaTokenizer.

It would be possible to modify here to add "xlm-roberta-base" into _MODEL_CLASSES. Or you can copy those codes in load_plm out into your juypter notebook, and modify those model_class.config, model_class.tokenizer, etc. into xlm-roberta related one.

Achazwl avatar Oct 14 '22 07:10 Achazwl

@Achazwl thank you! any future plans on extending the framework for XLMR as well?

cmgchess avatar Oct 14 '22 13:10 cmgchess

Dear All I have a question about modifying init.py please guide me. I want to use the SciBERT model from Huggingface I try to add the model and tokenizer to init.py in colab. I don't know what is the config or wrapper. after that, I close the init.py and run again but the Seibert is not recognized. How I can test other models in Huggingface?

HodaMemar avatar Jan 13 '23 08:01 HodaMemar

Dear All I have a question about modifying init.py please guide me. I want to use the SciBERT model from Huggingface I try to add the model and tokenizer to init.py in colab. I don't know what is the config or wrapper. after that, I close the init.py and run again but the Seibert is not recognized. How I can test other models in Huggingface?

after you modified the code, you should reload code in your python working space. eg: `from imp import reload

openprompt = reload(openprompt)

load_plm = openprompt.plms.load_plm`

and you should modify the code, and import it using `import sys

sys.path.insert(0, '/location_path/OpenPrompt')`

kinghmy avatar Mar 29 '23 12:03 kinghmy

Thank you for your reply

I change the code in colab like the bellow: Uploading image.png…

Adding a model will result in an error. I probably didn't do the right in reloading the module Your guidance in this regard will be very valuable

On Wed, 29 Mar 2023, 4:41 pm kinghmy, @.***> wrote:

Dear All I have a question about modifying init.py please guide me. I want to use the SciBERT model from Huggingface I try to add the model and tokenizer to init.py in colab. I don't know what is the config or wrapper. after that, I close the init.py and run again but the Seibert is not recognized. How I can test other models in Huggingface?

after you modified the code, you should reload code in your python working space. eg: from imp import reload openprompt = reload(openprompt) load_plm = openprompt.plms.load_plm

and you should modify the code, and import it using import sys sys.path.insert(0, '/location_path/OpenPrompt')

— Reply to this email directly, view it on GitHub https://github.com/thunlp/OpenPrompt/issues/199#issuecomment-1488486380, or unsubscribe https://github.com/notifications/unsubscribe-auth/AP4V5FJ4I6HISD5EBOFLC6DW6QRHXANCNFSM6AAAAAARE4SUEE . You are receiving this because you commented.Message ID: @.***>

HodaMemar avatar Mar 29 '23 14:03 HodaMemar

Thank you for your reply

I change the code in colab like the bellow: 1- add this model to init.py 'PubMedBERT': ModelClass(**{ 'config': BertConfig, 'tokenizer': AutoTokenizer.from_pretrained('microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext'), 'model':AutoModel.from_pretrained('microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext'), 'wrapper': MLMTokenizerWrapper, }), 2- reload the model `import sys import importlib

sys.path.insert(0, '/content/OpenPrompt') importlib.reload(sys)`

3- run the cell: plm, tokenizer, model_config, WrapperClass = load_plm("PubMedBERT",'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext')

4- I get this error:

`KeyError Traceback (most recent call last) in <cell line: 1>() ----> 1 plm, tokenizer, model_config, WrapperClass = load_plm("PubMedBERT",'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext')

1 frames /content/OpenPrompt/OpenPrompt/openprompt/plms/init.py in get_model_class(plm_type) 89 "tokenizer": GPT2Tokenizer, 90 "model": GPTJForCausalLM, ---> 91 "wrapper": LMTokenizerWrapper 92 }), 93 }

KeyError: 'PubMedBERT'`

Adding a model will result in an error. I probably didn't do the right in reloading the module Your guidance in this regard will be very valuable

HodaMemar avatar Apr 09 '23 07:04 HodaMemar

Thank you for your reply

I change the code in colab like the bellow in the attached figure. Adding a model will result in an error. I probably didn't do the right in reloading the module Your guidance in this regard will be very valuable

On Wed, 29 Mar 2023, 6:02 pm Hoda Memarzadeh, @.***> wrote:

Thank you for your reply

On Wed, 29 Mar 2023, 4:41 pm kinghmy, @.***> wrote:

Dear All I have a question about modifying init.py please guide me. I want to use the SciBERT model from Huggingface I try to add the model and tokenizer to init.py in colab. I don't know what is the config or wrapper. after that, I close the init.py and run again but the Seibert is not recognized. How I can test other models in Huggingface?

after you modified the code, you should reload code in your python working space. eg: from imp import reload openprompt = reload(openprompt) load_plm = openprompt.plms.load_plm

and you should modify the code, and import it using import sys sys.path.insert(0, '/location_path/OpenPrompt')

— Reply to this email directly, view it on GitHub https://github.com/thunlp/OpenPrompt/issues/199#issuecomment-1488486380, or unsubscribe https://github.com/notifications/unsubscribe-auth/AP4V5FJ4I6HISD5EBOFLC6DW6QRHXANCNFSM6AAAAAARE4SUEE . You are receiving this because you commented.Message ID: @.***>

HodaMemar avatar Apr 09 '23 07:04 HodaMemar

Hi,

So if you want a potential fix that goes around the "load_plm" function from OpenPrompt, you can load each component in separately and then merge:

Actually I have one thing you can try - it will avoid using OpenPrompts "load_plm" function. For instance, the SciBERT model should still work with OpenPrompts MLM tokenizer wrapper, so you can load the components in separately and them piece together.

Imports

from openprompt.plms.seq2seq import T5TokenizerWrapper, T5LMTokenizerWrapper
from openprompt.plms.lm import LMTokenizerWrapper
from openprompt.plms.mlm import MLMTokenizerWrapper
from transformers import T5Config, T5Tokenizer, T5ForConditionalGeneration
from transformers import AutoModelForCausalLM, AutoModelForSeq2SeqLM, AutoModelForMaskedLM, AutoTokenizer,

Load components separately

model_name = "your_mode_name_here"
plm = AutoModelForMaskedLM.from_pretrained(model_name)
WrapperClass = MLMTokenizerWrapper
tokenizer = AutoTokenizer.from_pretrained(model_name,  use_fast = False)

Then you pass these to the prompt dataloader as you normally would. I do not have time right now to test this for the models outlined in this issue, but this has worked for me when using custom models. But SciBERT under the hood should potentially work directly with the OpenPrompt MLMTokenizerWrapper.

NtaylorOX avatar May 09 '23 10:05 NtaylorOX

你好,来信我已收到,我会尽快处理,谢谢!

kinghmy avatar May 09 '23 10:05 kinghmy

Hi

Thank you very much for your time and explanation.

On Tue, May 9, 2023 at 1:58 PM kinghmy @.***> wrote:

你好,来信我已收到,我会尽快处理,谢谢!

— Reply to this email directly, view it on GitHub https://github.com/thunlp/OpenPrompt/issues/199#issuecomment-1539896269, or unsubscribe https://github.com/notifications/unsubscribe-auth/AP4V5FOCR46JC7Z37KX76XDXFIL4JANCNFSM6AAAAAARE4SUEE . You are receiving this because you commented.Message ID: @.***>

HodaMemar avatar May 09 '23 10:05 HodaMemar