"from_pretrained" read wrong config file. not "tokenizer_config.json", but "config.json"
Hi, I found interesting bug(maybe I could be wrong) that is in from_pretrained. below are the code that i produce my bug.
model = T5ForConditionalGeneration.from_pretrained(
model,
local_files_only=True
)
model.to(device)
tokenizer = T5Tokenizer.from_pretrained(
model,
trust_remote_code=True,
TOKENIZERS_PARALLELISM=True,
local_files_only=True,
skip_special_tokens=True
)
The model directory contains fine tuned T5 tensors and other necessary files with training results. Specific tree is like below
model/
ã„´ config.json // configuration about T5 starts with archietuecture: "T5ForConditionalGeneration"
ã„´ generation_config.json
ã„´ model.safetensors
ã„´ special_tokens_map.json
ã„´ spiece.model
ã„´ toeknizer.json
ã„´ tokenizer_config.json
...(other files)
Whenever I try the code above, I can get errors like below
he above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "inference_script.py", line 33, in <module>
tokenizer = T5Tokenizer.from_pretrained(
File "/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2010, in from_pretrained
resolved_config_file = cached_file(
File "/python3.9/site-packages/transformers/utils/hub.py", line 462, in cached_file
raise EnvironmentError(
OSError: Incorrect path_or_model_id: 'T5ForConditionalGeneration(
However, after moving files that is related to tokenizers, and fix some code, I can get no errors. Below are fixed code and changed repo tree
model = T5ForConditionalGeneration.from_pretrained(
model,
local_files_only=True
)
model.to(device)
tokenizer = T5Tokenizer.from_pretrained(
tokenizer_path,
trust_remote_code=True,
TOKENIZERS_PARALLELISM=True,
local_files_only=True,
skip_special_tokens=True
)
in tokenizer_path
tokenizer_path
ã„´ special_tokens_map.json
ã„´ spiece.model
ã„´ toeknizer.json
ã„´ tokenizer_config.json
Therefore, I Guess tokenizer.from_pretrained() method is reading config.json other than tokenizer_config.json.
If I am right, can you fix this feature in the following release?
(It seems If there exist "confing.json" and "tokenizer_config.json" at the same time, "config.json" wins at all)
Thanks for reading my issue!