transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Using `auto_map` in `tokenizer_config.json` gives `TypeError: argument of type 'NoneType' is not iterable`

Open larrylawl opened this issue 2 years ago • 2 comments

System Info

certifi==2022.12.7 charset-normalizer==3.1.0 cmake==3.26.3 filelock==3.12.0 fsspec==2023.4.0 huggingface-hub==0.14.0 idna==3.4 Jinja2==3.1.2 lit==16.0.2 MarkupSafe==2.1.2 mpmath==1.3.0 networkx==3.1 numpy==1.24.3 nvidia-cublas-cu11==11.10.3.66 nvidia-cuda-cupti-cu11==11.7.101 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cudnn-cu11==8.5.0.96 nvidia-cufft-cu11==10.9.0.58 nvidia-curand-cu11==10.2.10.91 nvidia-cusolver-cu11==11.4.0.1 nvidia-cusparse-cu11==11.7.4.91 nvidia-nccl-cu11==2.14.3 nvidia-nvtx-cu11==11.7.91 packaging==23.1 PyYAML==6.0 regex==2023.3.23 requests==2.28.2 sentencepiece==0.1.98 sympy==1.11.1 tokenizers==0.13.3 torch==2.0.0 tqdm==4.65.0 -e git+https://github.com/huggingface/transformers.git@073baf7f2289dbbf99e29f375e40c3e270ba6e85#egg=transformers triton==2.0.0 typing-extensions==4.5.0 urllib3==1.26.15

Who can help?

@ArthurZucker

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

Running the following...

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-10b-chinese", trust_remote_code=True)

Gave the error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jovyan/transformers/src/transformers/models/auto/tokenization_auto.py", line 692, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/jovyan/transformers/src/transformers/tokenization_utils_base.py", line 1812, in from_pretrained
    return cls._from_pretrained(
  File "/home/jovyan/transformers/src/transformers/tokenization_utils_base.py", line 1878, in _from_pretrained
    init_kwargs["auto_map"] = add_model_info_to_auto_map(
  File "/home/jovyan/transformers/src/transformers/utils/generic.py", line 563, in add_model_info_to_auto_map
    auto_map[key] = [f"{repo_id}--{v}" if "--" not in v else v for v in value]
  File "/home/jovyan/transformers/src/transformers/utils/generic.py", line 563, in <listcomp>
    auto_map[key] = [f"{repo_id}--{v}" if "--" not in v else v for v in value]
TypeError: argument of type 'NoneType' is not iterable

Expected behavior

Load tokenizer without errors.

Analysis

  • I suspect it has to do with auto_map in tokenizer_config.json here
  • The tokenizer loads fine with transformers version 4.27.0

larrylawl avatar Apr 25 '23 08:04 larrylawl

cc @sgugger seems like #22814 added

        if "auto_map" in init_kwargs and not _is_local:
            # For backward compatibility with odl format.
            if isinstance(init_kwargs["auto_map"], (tuple, list)):
                init_kwargs["auto_map"] = {"AutoTokenizer": init_kwargs["auto_map"]}
            init_kwargs["auto_map"] = add_model_info_to_auto_map(
                init_kwargs["auto_map"], pretrained_model_name_or_path
            )

I can take this on but you are more familiar with the changes

ArthurZucker avatar Apr 25 '23 09:04 ArthurZucker

Thanks for flagging! The PR linked above should fix this.

sgugger avatar Apr 25 '23 13:04 sgugger