tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

"from_pretrained" read wrong config file. not "tokenizer_config.json", but "config.json"

Open daehuikim opened this issue 1 year ago • 0 comments

Hi, I found interesting bug(maybe I could be wrong) that is in from_pretrained. below are the code that i produce my bug.

model = T5ForConditionalGeneration.from_pretrained(
        model,
        local_files_only=True
    )
    model.to(device)
    tokenizer = T5Tokenizer.from_pretrained(
        model, 
        trust_remote_code=True, 
        TOKENIZERS_PARALLELISM=True,
        local_files_only=True,
        skip_special_tokens=True
    )

The model directory contains fine tuned T5 tensors and other necessary files with training results. Specific tree is like below

model/
ã„´ config.json  // configuration about T5 starts with archietuecture: "T5ForConditionalGeneration"
ã„´ generation_config.json
ã„´ model.safetensors
ã„´ special_tokens_map.json
ã„´ spiece.model
ã„´ toeknizer.json
ã„´ tokenizer_config.json
...(other files)

Whenever I try the code above, I can get errors like below

he above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "inference_script.py", line 33, in <module>
    tokenizer = T5Tokenizer.from_pretrained(
  File "/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2010, in from_pretrained
    resolved_config_file = cached_file(
  File "/python3.9/site-packages/transformers/utils/hub.py", line 462, in cached_file
    raise EnvironmentError(
OSError: Incorrect path_or_model_id: 'T5ForConditionalGeneration(

However, after moving files that is related to tokenizers, and fix some code, I can get no errors. Below are fixed code and changed repo tree

model = T5ForConditionalGeneration.from_pretrained(
        model,
        local_files_only=True
    )
    model.to(device)
    tokenizer = T5Tokenizer.from_pretrained(
        tokenizer_path, 
        trust_remote_code=True, 
        TOKENIZERS_PARALLELISM=True,
        local_files_only=True,
        skip_special_tokens=True
    )

in tokenizer_path

tokenizer_path
ã„´ special_tokens_map.json
ã„´ spiece.model
ã„´ toeknizer.json
ã„´ tokenizer_config.json

Therefore, I Guess tokenizer.from_pretrained() method is reading config.json other than tokenizer_config.json. If I am right, can you fix this feature in the following release? (It seems If there exist "confing.json" and "tokenizer_config.json" at the same time, "config.json" wins at all) Thanks for reading my issue!

daehuikim avatar May 23 '24 02:05 daehuikim