aitextgen icon indicating copy to clipboard operation
aitextgen copied to clipboard

GPT-3 support (?)

Open minimaxir opened this issue 5 years ago • 5 comments

https://github.com/openai/gpt-3

Still very unclear how OpenAI is treating this, but if they do release the small model, then I'll def add support.

Implementation depends on:

  • base Huggingface integration
  • new training loop (might be the same as the old training loop)
  • Need to reduce GPT2 hardcoding where available
  • Casting for FP16 to FP32 (or more native FP16 support)

minimaxir avatar May 29 '20 02:05 minimaxir

Any chance to implement the following in GPT-3? Or was it done somewhere else for enhancing GPT-2?

Appendix B.

.... During training we always train on sequences of the full nctx = 2048 token context window, packing multiple documents into a single sequence when documents are shorter than 2048, in order to increase computational efficiency. Sequences with multiple documents are not masked in any special way but instead documents within a sequence are delimited with a special end of text token, giving the language model the information necessary to infer that context separated by the end of text token is unrelated. This allows for efficient training without need for any special sequence-specific masking. ....

leejason avatar May 29 '20 02:05 leejason

That's done implicitly when passing in multiple texts/a line-by-line file. (it was also done with gpt-2-simple in the same way)

minimaxir avatar May 29 '20 03:05 minimaxir

Thanks for the update. Does "done implicitly" mean the following?

raw_text += start_token + row[0] + end_token + "\n"

If so, how would the attention mechanism take start_token and end_token into consideration and bypass tokens before start_token or after end_token? I search "end_token" in all source codes (gpt-2simple) but could not find a clue. My understanding is probably on a wrong path. Could you shed some light?

leejason avatar May 29 '20 04:05 leejason

Wouldn't it be too demanding for hardware? We still can't run even 1.5B properly on most of machines.

fen0s avatar May 29 '20 10:05 fen0s

If so, how would the attention mechanism take start_token and end_token into consideration and bypass tokens before start_token or after end_token? I search "end_token" in all source codes (gpt-2simple) but could not find a clue. My understanding is probably on a wrong path. Could you shed some light?

The network learns how to deal with that (eventually).

Wouldn't it be too demanding for hardware? We still can't run even 1.5B properly on most of machines.

There are submodels which I expect OpenAI will release first that are about the same size as the current GPT-2 models.

There may be issues finetuning those since the context window is doubled and training speed scales exponentially with context window size, but we will see.

minimaxir avatar May 31 '20 15:05 minimaxir