Add default tokenizer for gpt_neox (the same as gpt_neo)

Open aalok-sathe opened this issue 3 years ago • 1 comments

The tokenization_auto.py was missing a mapping for gpt_neox, causing the AutoTokenizer initialization for GPT NeoX to fail at runtime:

File ..., in load_tokenizer(model_name_or_path='./gpt-neox-20b', **kwargs={'cache_dir': '.cache/'})
     16 def load_tokenizer(model_name_or_path: str = None, **kwargs) -> AutoTokenizer:
---> 17     return AutoTokenizer.from_pretrained(model_name_or_path, **kwargs)
        model_name_or_path = './gpt-neox-20b'
        kwargs = {'cache_dir': '.cache/'}

File lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py:525, in AutoTokenizer.from_pretrained(cls=<class 'transformers.models.auto.tokenization_auto.AutoTokenizer'>, pretrained_model_name_or_path='./gpt-neox-20b', *inputs=(), **kwargs={'_from_auto': True, 'cache_dir': .cache/'})
    522         tokenizer_class = tokenizer_class_from_name(tokenizer_class_candidate)
    524     if tokenizer_class is None:
--> 525         raise ValueError(
    526             f"Tokenizer class {tokenizer_class_candidate} does not exist or is not currently imported."
    527         )
    528     return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    530 # Otherwise we have to be creative.
    531 # if model is an encoder decoder, the encoder tokenizer class is used by default

ValueError: Tokenizer class GPTNeoXTokenizer does not exist or is not currently imported.

Apr 21 '22 17:04 aalok-sathe

Thanks! Actually NeoX doesn't use the GPT2Tokenizer. I'll fix the current PR based on this though.

Apr 21 '22 19:04 zphang