qlora icon indicating copy to clipboard operation
qlora copied to clipboard

ValueError: Tokenizer class GPTNeoXTokenizer does not exist or is not currently imported.

Open zhashen opened this issue 2 years ago • 7 comments

When I tried

!python qlora.py –learning_rate 0.0001 --model_name_or_path EleutherAI/gpt-neox-20b --trust_remote_code

in colab, i got following errors

2023-06-03 13:54:17.113623: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
loading base model EleutherAI/gpt-neox-20b...
Loading checkpoint shards: 100% 46/46 [04:20<00:00,  5.66s/it]
adding LoRA modules...
trainable params: 138412032.0 || all params: 10865725440 || trainable: 1.2738406907509712
loaded model
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/qlora/qlora.py:790 in <module>                                      │
│                                                                              │
│   787 │   │   │   fout.write(json.dumps(all_metrics))                        │
│   788                                                                        │
│   789 if __name__ == "__main__":                                             │
│ ❱ 790 │   train()                                                            │
│   791                                                                        │
│                                                                              │
│ /content/qlora/qlora.py:635 in train                                         │
│                                                                              │
│   632 │   set_seed(args.seed)                                                │
│   633 │                                                                      │
│   634 │   # Tokenizer                                                        │
│ ❱ 635 │   tokenizer = AutoTokenizer.from_pretrained(                         │
│   636 │   │   args.model_name_or_path,                                       │
│   637 │   │   cache_dir=args.cache_dir,                                      │
│   638 │   │   padding_side="right",                                          │
│                                                                              │
│ /usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenizatio │
│ n_auto.py:691 in from_pretrained                                             │
│                                                                              │
│   688 │   │   │   │   tokenizer_class = tokenizer_class_from_name(tokenizer_ │
│   689 │   │   │                                                              │
│   690 │   │   │   if tokenizer_class is None:                                │
│ ❱ 691 │   │   │   │   raise ValueError(                                      │
│   692 │   │   │   │   │   f"Tokenizer class {tokenizer_class_candidate} does │
│   693 │   │   │   │   )                                                      │
│   694 │   │   │   return tokenizer_class.from_pretrained(pretrained_model_na │
╰──────────────────────────────────────────────────────────────────────────────╯
ValueError: Tokenizer class GPTNeoXTokenizer does not exist or is not currently 
imported.

zhashen avatar Jun 03 '23 14:06 zhashen

check your config.json (that comes with the model weights) and see if the name is misspelled. It happens often with mixed case names.

phalexo avatar Jun 03 '23 15:06 phalexo

tokenizer = AutoTokenizer.from_pretrained( args.model_name_or_path, cache_dir=args.cache_dir, padding_side="right", use_fast=True, # Fast tokenizer giving issues. tokenizer_type='llama' if 'llama' in args.model_name_or_path else None, # Needed for HF name change )

T-Atlas avatar Jun 05 '23 06:06 T-Atlas

I had this issue when I ran python3 qlora.py. And I second @T-Atlas 's solution.

The reason is that the default model in qlora.py is EleutherAI/pythia-12b

https://github.com/artidoro/qlora/blob/3da535abdfaa29a2d0757eab0971664ed2cd97e8/qlora.py#L53-L55

which depends on GPTNeoXTokenizer.

https://huggingface.co/EleutherAI/pythia-12b/blob/main/tokenizer_config.json#L7

GPTNexoXTokenzier has only the fast version.

https://github.com/huggingface/transformers/issues/17756#issuecomment-1534219526

But qlora.py disables the use of fast tokenizers.

wangkuiyi avatar Jun 08 '23 17:06 wangkuiyi

it works

SeekPoint avatar Jun 09 '23 13:06 SeekPoint

it works

what works ? can you elaborate?

pzdkn avatar Jul 13 '23 09:07 pzdkn

I had to change "tokenizer_class": "GPTNeoXTokenizer" to "tokenizer_class":"GPTNeoXTokenizerFast" in tokenizer_config.json.

WillsonAmalrajA avatar Jul 14 '23 02:07 WillsonAmalrajA

I had this issue when I ran python3 qlora.py. And I second @T-Atlas 's solution.

The reason is that the default model in qlora.py is EleutherAI/pythia-12b

https://github.com/artidoro/qlora/blob/3da535abdfaa29a2d0757eab0971664ed2cd97e8/qlora.py#L53-L55

which depends on GPTNeoXTokenizer.

https://huggingface.co/EleutherAI/pythia-12b/blob/main/tokenizer_config.json#L7

GPTNexoXTokenzier has only the fast version.

huggingface/transformers#17756 (comment)

But qlora.py disables the use of fast tokenizers.

Enabling fast tokenizers fixed this (in the qlora.py script). Although it was mentioned that setting tokenizer fast to TRUE causes issues, setting it to FALSE results in the error described by the OP.

olympus-terminal avatar May 23 '24 05:05 olympus-terminal