transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Not work cache_dir of AutoTokenizer.from_pretrained('gpt2')

Open irene622 opened this issue 1 year ago • 1 comments

System Info

My transformers is version 4.11.3, python version is 3.8.5, and Ubuntu 20.04.1.

I want to know the cache directory when downloading AutoTokenizer.from_pretrained('gpt2') I run the below code

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.cache_dir

then, the result is AttributeError like

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'GPT2TokenizerFast' object has no attribute 'cache_dir'

When run tokenizer.cache_dir() , the result is the same AttributeError.

The downloaded tokenizer is from CodeParrot. CodeParrot is in transformers/examples/research_projects/codeparrot/, and codeparrot/scripts/bpe_training.py download AutoTokenizer.from_pretrained('gpt2').

How can I get the cache directory path of tokenizer?? What is my problems?

I want to know from method or variable of tokenizer, not the path. (I already find ~/.cache/huggingface/transformers have cache files.) If possible, I would like to know for tokenizer how to use the three files, .json, .lock, and the last file with no extension.

Who can help?

No response

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

Screenshot 2023-04-18 at 6 37 54 PM

The tokenizer code in CodeParrot. My running code is Screenshot 2023-04-18 at 6 38 32 PM and scripts/bpe_training.py code is Screenshot 2023-04-18 at 6 39 41 PM

Expected behavior

I want to get the cache directory path of downloaded tokenizer. I want to know from method or variable of tokenizer, not the path. (I already find ~/.cache/huggingface/transformers have cache files.)

Moreover, if possible, I would like to know how to use the three files, .json, .lock, and the last file with no extension.

irene622 avatar Apr 18 '23 09:04 irene622

Hi @irene622, thanks for raising this issue!

cache_dir isn't an attribute of the class, and so calling tokenizer.cache_dir will raise an error.

You can find the cache directory, importing from utils:

from transformers.utils import TRANSFORMERS_CACHE

When a tokenizer is created, should have the name_or_path attribute set, which will tell you from which model repo, or path it was loaded from.

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("xlm-mlm-en-2048")
>>> tokenizer.name_or_path
'xlm-mlm-en-2048'

amyeroberts avatar Apr 24 '23 11:04 amyeroberts

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 18 '23 15:05 github-actions[bot]