transformers
transformers copied to clipboard
Not work cache_dir of AutoTokenizer.from_pretrained('gpt2')
System Info
My transformers is version 4.11.3, python version is 3.8.5, and Ubuntu 20.04.1.
I want to know the cache directory when downloading AutoTokenizer.from_pretrained('gpt2') I run the below code
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('gpt2')
tokenizer.cache_dir
then, the result is AttributeError
like
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'GPT2TokenizerFast' object has no attribute 'cache_dir'
When run tokenizer.cache_dir()
, the result is the same AttributeError
.
The downloaded tokenizer is from CodeParrot.
CodeParrot is in transformers/examples/research_projects/codeparrot/
, and codeparrot/scripts/bpe_training.py
download AutoTokenizer.from_pretrained('gpt2')
.
How can I get the cache directory path of tokenizer?? What is my problems?
I want to know from method or variable of tokenizer, not the path. (I already find ~/.cache/huggingface/transformers have cache files.) If possible, I would like to know for tokenizer how to use the three files, .json, .lock, and the last file with no extension.
Who can help?
No response
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction

The tokenizer code in CodeParrot.
My running code is
and scripts/bpe_training.py code is
Expected behavior
I want to get the cache directory path of downloaded tokenizer. I want to know from method or variable of tokenizer, not the path. (I already find ~/.cache/huggingface/transformers have cache files.)
Moreover, if possible, I would like to know how to use the three files, .json, .lock, and the last file with no extension.
Hi @irene622, thanks for raising this issue!
cache_dir
isn't an attribute of the class, and so calling tokenizer.cache_dir
will raise an error.
You can find the cache directory, importing from utils:
from transformers.utils import TRANSFORMERS_CACHE
When a tokenizer is created, should have the name_or_path
attribute set, which will tell you from which model repo, or path it was loaded from.
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("xlm-mlm-en-2048")
>>> tokenizer.name_or_path
'xlm-mlm-en-2048'
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.