transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Autotokenizer."from_pretrained" read wrong config file. not "tokenizer_config.json", but "config.json"

Open daehuikim opened this issue 1 year ago • 3 comments

System Info

  • transformers version: 4.40.0
  • Platform: Linux-4.18.0-425.3.1.el8.x86_64-x86_64-with-glibc2.28
  • Python version: 3.9.0
  • Huggingface_hub version: 0.20.1
  • Safetensors version: 0.4.1
  • Accelerate version: 0.30.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.3.0+cu121 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@younesbelkada @ArthurZucker

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

Hi, I found interesting bug(maybe I could be wrong) that is in from_pretrained. below are the code that i produce my bug.

model = T5ForConditionalGeneration.from_pretrained(
        model,
        local_files_only=True
    )
model.to(device)
tokenizer = T5Tokenizer.from_pretrained(
    model, 
    trust_remote_code=True, 
    TOKENIZERS_PARALLELISM=True,
    local_files_only=True,
    skip_special_tokens=True
)

The model directory contains fine tuned T5 tensors and other necessary files with training results. Specific tree is like below

model/
ã„´ config.json  // configuration about T5 starts with archietuecture: "T5ForConditionalGeneration"
ã„´ generation_config.json
ã„´ model.safetensors
ã„´ special_tokens_map.json
ã„´ spiece.model
ã„´ tokenizer.json
ã„´ tokenizer_config.json
...(other files)

Whenever I try the code above, I can get errors like below

he above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "inference_script.py", line 33, in <module>
    tokenizer = T5Tokenizer.from_pretrained(
  File "/python3.9/site-packages/transformers/tokenization_utils_base.py", line 2010, in from_pretrained
    resolved_config_file = cached_file(
  File "/python3.9/site-packages/transformers/utils/hub.py", line 462, in cached_file
    raise EnvironmentError(
OSError: Incorrect path_or_model_id: 'T5ForConditionalGeneration(

However, after moving files that is related to tokenizers, and fix some code, I can get no errors. Below are fixed code and changed repo tree

model = T5ForConditionalGeneration.from_pretrained(
        model,
        local_files_only=True
    )
model.to(device)
tokenizer = T5Tokenizer.from_pretrained(
    tokenizer_path, 
    trust_remote_code=True, 
    TOKENIZERS_PARALLELISM=True,
    local_files_only=True,
    skip_special_tokens=True
)

in tokenizer_path

tokenizer_path
ã„´ special_tokens_map.json
ã„´ spiece.model
ã„´ tokenizer.json
ã„´ tokenizer_config.json

Expected behavior

Therefore, I Guess tokenizer.from_pretrained() method is reading config.json other than tokenizer_config.json. If I am right, can you fix this feature in the following release? (It seems If there exist "confing.json" and "tokenizer_config.json" at the same time, "config.json" wins at all) Thanks for reading my issue!

daehuikim avatar May 23 '24 02:05 daehuikim

cc @Rocketknight1 would be nice if you can have a look ! (trust remote code!)

ArthurZucker avatar May 23 '24 14:05 ArthurZucker

just noticed with a quick look, is it expected that you have "toEKnizer.json" instead of "toKEnizer.json", might that lead to an error? Or is it just a typo in the description?

qubvel avatar May 23 '24 18:05 qubvel

@qubvel That's typo in the description! :) that's because I typed filenames by myself (I fixed it in the description)

daehuikim avatar May 24 '24 00:05 daehuikim

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jun 22 '24 08:06 github-actions[bot]