Encode of a new dataset, confused about <|endoftext|> encoding
When encode a new dataset and use <|endoftext|> as delimiter, for example:
message <|endoftext|> message
The encode function in "src/encoder.py" will transform the encoding of "<|endoftext>" into [27, 91, 437, 1659, 5239, 91, 29] instead of [50256] (50256 is index of <|endoftext> in dict).
So I go to check "src/encoder.py", find that
import regex as re
pat = re.compile(r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
text = "<|endoftext|> hi."
for token in re.findall(pat, text):
print(token)
I get:
<| endoftext |> hi .
Why it splits <|endoftext|> into three parts (which I think it leads to the wrong encoding of <|endoftext|>)? Should it rather be:
<|endoftext|> hi .
@AliceZhang2016, I don't know if you have solved this issue already, the enc.encode('some text goes here') function from openai/gpt-2 assumes the input to be the text content only, and is not designed to detect special tokens.
I assume @nshepperd (thanks for releasing this repository! 😊) thought the encoder is able to detect the <|endoftext|> token and assign it the value 50256.
I have made a modification inside the load_dataset() function in load_dataset.py to handle it appropriately here.
That being said, I don't think it makes overly a large difference in the fine tuning process. Running the fine tuning training script, the network seems to learn/associate the sequence of <|endoftext|> broken down as text is the <|endoftext|> itself.
I may be wrong here but let me know your thoughts! Thanks :)
@farrell236 Thanks for sharing your idea : )
Notice that the line 39 in load_dataset.py , the author directly add <|endoftext|> after raw_text, which will be encoded using enc.encode() in line 35, I don't understand your assumption that the encoder is able to detect the <|endoftext|> token because I don't find the detection code in encoder.py.
I made the similar modifications like yours, that is to encode plain text by enc.encode() and manually add encoding of <|endoftext|>. I also agree that it won't make large difference in fine tuning process. But I think that encode <|endoftext|> as a whole could be more reasonable, just like what you modified in your code.
@AliceZhang2016, I may have worded it badly. I do agree that enc.encode('block before <|endoftext|> block after') does not detect <|endoftext|> as a token, and instead breaks it down into chunks.
In[2]: enc.encode('<|endoftext|>')
Out[2]: [27, 91, 437, 1659, 5239, 91, 29]
which corresponds to ['<', '|', 'end', 'of', 'text', '|', '>'] bpe codes
The changes were to circumvent this :)
@farrell236 Then I totally agree with you. 😄