Autotokenizer."from_pretrained" read wrong config file. not "tokenizer_config.json", but "config.json"
System Info
transformersversion: 4.40.0- Platform: Linux-4.18.0-425.3.1.el8.x86_64-x86_64-with-glibc2.28
- Python version: 3.9.0
- Huggingface_hub version: 0.20.1
- Safetensors version: 0.4.1
- Accelerate version: 0.30.1
- Accelerate config: not found
- PyTorch version (GPU?): 2.3.0+cu121 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
@younesbelkada @ArthurZucker
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
Hi, I found interesting bug(maybe I could be wrong) that is in from_pretrained. below are the code that i produce my bug.
model = T5ForConditionalGeneration.from_pretrained(
model,
local_files_only=True
)
model.to(device)
tokenizer = T5Tokenizer.from_pretrained(
model,
trust_remote_code=True,
TOKENIZERS_PARALLELISM=True,
local_files_only=True,
skip_special_tokens=True
)
The model directory contains fine tuned T5 tensors and other necessary files with training results. Specific tree is like below
model/
ã„´ config.json // configuration about T5 starts with archietuecture: "T5ForConditionalGeneration"
ã„´ generation_config.json
ã„´ model.safetensors
ã„´ special_tokens_map.json
ã„´ spiece.model
ã„´ tokenizer.json
ã„´ tokenizer_config.json
...(other files)
Whenever I try the code above, I can get errors like below
he above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "inference_script.py", line 33, in <module>
tokenizer = T5Tokenizer.from_pretrained(
File "/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2010, in from_pretrained
resolved_config_file = cached_file(
File "/python3.9/site-packages/transformers/utils/hub.py", line 462, in cached_file
raise EnvironmentError(
OSError: Incorrect path_or_model_id: 'T5ForConditionalGeneration(
However, after moving files that is related to tokenizers, and fix some code, I can get no errors. Below are fixed code and changed repo tree
model = T5ForConditionalGeneration.from_pretrained(
model,
local_files_only=True
)
model.to(device)
tokenizer = T5Tokenizer.from_pretrained(
tokenizer_path,
trust_remote_code=True,
TOKENIZERS_PARALLELISM=True,
local_files_only=True,
skip_special_tokens=True
)
in tokenizer_path
tokenizer_path
ã„´ special_tokens_map.json
ã„´ spiece.model
ã„´ tokenizer.json
ã„´ tokenizer_config.json
Expected behavior
Therefore, I Guess tokenizer.from_pretrained() method is reading config.json other than tokenizer_config.json.
If I am right, can you fix this feature in the following release?
(It seems If there exist "confing.json" and "tokenizer_config.json" at the same time, "config.json" wins at all)
Thanks for reading my issue!
cc @Rocketknight1 would be nice if you can have a look ! (trust remote code!)
just noticed with a quick look, is it expected that you have "toEKnizer.json" instead of "toKEnizer.json", might that lead to an error? Or is it just a typo in the description?
@qubvel That's typo in the description! :) that's because I typed filenames by myself (I fixed it in the description)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.