gpt-2 icon indicating copy to clipboard operation
gpt-2 copied to clipboard

enc.encoder["<|endoftext|>"] is wrong and nobody realizes it.

Open shawwn opened this issue 5 years ago • 8 comments

Relevant tweet chain:

https://twitter.com/theshawwn/status/1208169319223480322

https://twitter.com/theshawwn/status/1208171700057186304

Basically, you're prompting the model with <|endoftext|> (a single token with BPE value 50256 or whatever), but the BPE encoder encodes <|endoftext|> as <| end of text|>, five separate tokens. It's completely different.

shawwn avatar Dec 21 '19 00:12 shawwn

My main question is, were the OpenAI models trained with <|endoftext|> (a single token separating each document), or <| end of text |>, which is how the BPE encoder generates it?

shawwn avatar Dec 21 '19 00:12 shawwn

Update: It turns out the answer is that OpenAI trained their models by separating texts using the single-token <|endoftext|>, whereas most fine-tuning code is based on nshepperd's repo (https://github.com/nshepperd/gpt-2) which usually just uses the BPE encoder, and the BPE encoder generates <| end of text |> as 5 tokens.

shawwn avatar Dec 21 '19 03:12 shawwn

Thanks! Any suggestions how to hack/patch the encoder to properly deal with this? Or if finetuning is sufficient, we could just use END or something as a token? Would that even be a single token? Or are the use of multiple tokens even really that bad? It is working to use this token as a truncate stop, so it's being returned by generate properly at least...

inspire22 avatar Mar 05 '20 18:03 inspire22

Whoa I think I just ran into this issue. Would really appreciate any help!!

maxiedaniels avatar Mar 17 '20 07:03 maxiedaniels

So how does one stop the <|endoftext|> token to be randomly generated after just one sentence? Surely this must not be in the interest of the developers as it makes the "length" variable meaningless as truncating the text after the first <|endoftext|> returns randomized lengths and mostly too short lengths.

ErikUden avatar Feb 06 '21 20:02 ErikUden

I wrote a patch such that if the output contains "<|endoftext|>" I just rerun the whole batch. Reason being that when <|endoftext|> shows up, everything following has no relation (usually) to what the input prompt was.

For my conversational robots, I have it truncate everything before <|endoftext|>, and state "I feel I should have something more to say here, but I'm not sure how to proceed." Conversationally, it works most of the time. Still an issue though, but not for 99% of the people interacting with my robots, so... :)

This is less of an issue if you're using teh 1558M model I've found. What model are you using? I got this a LOT on the 345M model.

DaveXanatos avatar Feb 06 '21 23:02 DaveXanatos

Yes, I've been using the 355M parameter model. This got this issue with every generated text!

ErikUden avatar Feb 07 '21 10:02 ErikUden

I got my 345M model to a pretty good spot with the following parameters:

def interact_model(
    model_name='345M', 
    seed=None,
    nsamples=1,
    batch_size=1,
    length=140,
    temperature=1.2,
    top_k=48,
    top_p=0.7,
    models_dir='models',
):

I still get those EOT things occasionally, but usually only one out of 7 or 8 prompts.

DaveXanatos avatar Feb 07 '21 16:02 DaveXanatos