fastertransformer_backend GPT-J Preprocessing Incorrectly Tokenizes `<|endoftext|>`

Description

Expected behavior:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
>>> tokenizer.encode('<|endoftext|>')
[50256]

Reproduced Steps

Actual behavior:

$ cd all_models/gptj/preprocessing/1
$ python
>>> from word_list import to_word_list_format
>>> import numpy as np
>>> to_word_list_format(np.array([['<|endoftext|>']]))
array([[[  27,   91,  437, 1659, 5239,   91,   29],
        [   7,   -1,   -1,   -1,   -1,   -1,   -1]]], dtype=int32)

BPE merges seem to be working correctly. However, during pre-tokenization, <|endoftext|> is broken up into <|, endoftext, |>, with merges being applied to each of the parts separately. This seems incorrect, if we're using the huggingface implementation as reference.

I came across this bug trying to ban <|endoftext|> using the bad_words parameter.

Sep 14 '22 19:09 mitchellgordon95

Can you try to put <|endoftext|> into bad_words_list and print bad_words_ids in end_to_end_test.py You may not put correct input for the function.

Sep 15 '22 13:09 byshiue

$ cp tools/end_to_end_test.py tools/end_to_end_test_2.py
$ vim tools/end_to_end_test_2.py
$ diff tools/end_to_end_test.py tools/end_to_end_test_2.py
138c138
<             ["Hawks, Hawks"],
---
>             ["<|endoftext|>"],
145a146
>         print(f'bad_words_list: {bad_words_list}')
171c172
<             print(output0, output1, output2)
---
>             print(f'BAD_WORDS_IDS: {output3}')
$ python tools/end_to_end_test_2.py
bad_words_list: [['<|endoftext|>']
 ['']
 ['']
 ['']
 ['']
 ['']
 ['']
 ['']]
============After preprocessing============
BAD_WORDS_IDS: [[[  27   91  437 1659 5239   91   29]
  [   7   -1   -1   -1   -1   -1   -1]]

 [[   0    0    0    0    0    0    0]
  [  -1   -1   -1   -1   -1   -1   -1]]

 [[   0    0    0    0    0    0    0]
  [  -1   -1   -1   -1   -1   -1   -1]]

 [[   0    0    0    0    0    0    0]
  [  -1   -1   -1   -1   -1   -1   -1]]

 [[   0    0    0    0    0    0    0]
  [  -1   -1   -1   -1   -1   -1   -1]]

 [[   0    0    0    0    0    0    0]
  [  -1   -1   -1   -1   -1   -1   -1]]

 [[   0    0    0    0    0    0    0]
  [  -1   -1   -1   -1   -1   -1   -1]]

 [[   0    0    0    0    0    0    0]
  [  -1   -1   -1   -1   -1   -1   -1]]]
===========================================

Sep 15 '22 17:09 mitchellgordon95

The offending line is probably this https://github.com/triton-inference-server/fastertransformer_backend/blob/main/all_models/gptj/preprocessing/1/utils/gpt_token_encoder.py#L91

But I don't understand what it does

Sep 15 '22 17:09 mitchellgordon95

$ python
Python 3.6.9 (default, Dec  8 2021, 21:08:43)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import regex as re
>>> pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
>>> re.findall(pat, "<|endoftext|>")
['<|', 'endoftext', '|>']

Sep 15 '22 17:09 mitchellgordon95

Because of how the regex parses <|endoftext|> into three parts, we can never bpe-encode the whole string together https://github.com/triton-inference-server/fastertransformer_backend/blob/main/all_models/gptj/preprocessing/1/utils/gpt_token_encoder.py#L138

Sep 15 '22 17:09 mitchellgordon95

Hi, @mitchellgordon95. Thank you for the feedback. As you say, there are some issues in current converter. A simple solution to prevent it is using the tokenizer of huggingface to replace current tokenizer directly. For example, replacing

def to_word_list_format(word_dict):
    tokenizer = get_tokenizer()

by

from transformers import AutoTokenizer

def to_word_list_format(word_dict):
    cache_dir = Path(__file__).parent / ".cache"
    tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B", cache_dir=cache_dir)

We are also considering a better solution to do tokenization.

Sep 23 '22 07:09 byshiue

Is there a solution to this? I am unable to prevent GPT-J from generating <|endoftext|> tokens right now.

Dec 30 '22 02:12 152334H

We had a workaround for this, but I no longer have access to the codebase where I was working on it.

I believe what we ended up doing was editing the to_word_list_format function to have a special case for the <|endoftext|> string, manually adding 50256 to the list of banned tokens if EOT is present in the list.

Dec 30 '22 02:12 mitchellgordon95