gpt-2 Encode of a new dataset, confused about <|endoftext|> encoding

When encode a new dataset and use <|endoftext|> as delimiter, for example:

message <|endoftext|> message

The encode function in "src/encoder.py" will transform the encoding of "<|endoftext>" into [27, 91, 437, 1659, 5239, 91, 29] instead of [50256] (50256 is index of <|endoftext> in dict).

So I go to check "src/encoder.py", find that

import regex as re
pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
text = "<|endoftext|> hi."
for token in re.findall(pat, text):
    print(token)

I get:

<| endoftext |> hi .

Why it splits <|endoftext|> into three parts (which I think it leads to the wrong encoding of <|endoftext|>)? Should it rather be:

<|endoftext|> hi .

Oct 10 '19 08:10 AliceZhang2016

@AliceZhang2016, I don't know if you have solved this issue already, the enc.encode('some text goes here') function from openai/gpt-2 assumes the input to be the text content only, and is not designed to detect special tokens.

I assume @nshepperd (thanks for releasing this repository! 😊) thought the encoder is able to detect the <|endoftext|> token and assign it the value 50256.

I have made a modification inside the load_dataset() function in load_dataset.py to handle it appropriately here.

That being said, I don't think it makes overly a large difference in the fine tuning process. Running the fine tuning training script, the network seems to learn/associate the sequence of <|endoftext|> broken down as text is the <|endoftext|> itself.

I may be wrong here but let me know your thoughts! Thanks :)

Feb 19 '20 21:02 farrell236

@farrell236 Thanks for sharing your idea : )

Notice that the line 39 in load_dataset.py , the author directly add <|endoftext|> after raw_text, which will be encoded using enc.encode() in line 35, I don't understand your assumption that the encoder is able to detect the <|endoftext|> token because I don't find the detection code in encoder.py.

I made the similar modifications like yours, that is to encode plain text by enc.encode() and manually add encoding of <|endoftext|>. I also agree that it won't make large difference in fine tuning process. But I think that encode <|endoftext|> as a whole could be more reasonable, just like what you modified in your code.

Feb 20 '20 09:02 AliceZhang2016

@AliceZhang2016, I may have worded it badly. I do agree that enc.encode('block before <|endoftext|> block after') does not detect <|endoftext|> as a token, and instead breaks it down into chunks.

In[2]: enc.encode('<|endoftext|>')
Out[2]: [27, 91, 437, 1659, 5239, 91, 29]

which corresponds to ['<', '|', 'end', 'of', 'text', '|', '>'] bpe codes

The changes were to circumvent this :)

Feb 20 '20 11:02 farrell236

@farrell236 Then I totally agree with you. 😄

Feb 21 '20 10:02 AliceZhang2016