aitextgen
aitextgen copied to clipboard
GPT-3 support (?)
https://github.com/openai/gpt-3
Still very unclear how OpenAI is treating this, but if they do release the small model, then I'll def add support.
Implementation depends on:
- base Huggingface integration
- new training loop (might be the same as the old training loop)
- Need to reduce GPT2 hardcoding where available
- Casting for FP16 to FP32 (or more native FP16 support)
Any chance to implement the following in GPT-3? Or was it done somewhere else for enhancing GPT-2?
.... During training we always train on sequences of the full nctx = 2048 token context window, packing multiple documents into a single sequence when documents are shorter than 2048, in order to increase computational efficiency. Sequences with multiple documents are not masked in any special way but instead documents within a sequence are delimited with a special end of text token, giving the language model the information necessary to infer that context separated by the end of text token is unrelated. This allows for efficient training without need for any special sequence-specific masking. ....
That's done implicitly when passing in multiple texts/a line-by-line file. (it was also done with gpt-2-simple in the same way)
Thanks for the update. Does "done implicitly" mean the following?
raw_text += start_token + row[0] + end_token + "\n"
If so, how would the attention mechanism take start_token and end_token into consideration and bypass tokens before start_token or after end_token? I search "end_token" in all source codes (gpt-2simple) but could not find a clue. My understanding is probably on a wrong path. Could you shed some light?
Wouldn't it be too demanding for hardware? We still can't run even 1.5B properly on most of machines.
If so, how would the attention mechanism take start_token and end_token into consideration and bypass tokens before start_token or after end_token? I search "end_token" in all source codes (gpt-2simple) but could not find a clue. My understanding is probably on a wrong path. Could you shed some light?
The network learns how to deal with that (eventually).
Wouldn't it be too demanding for hardware? We still can't run even 1.5B properly on most of machines.
There are submodels which I expect OpenAI will release first that are about the same size as the current GPT-2 models.
There may be issues finetuning those since the context window is doubled and training speed scales exponentially with context window size, but we will see.