gpt-3 icon indicating copy to clipboard operation
gpt-3 copied to clipboard

Model release

Open JulianSlzr opened this issue 5 years ago • 16 comments

Great work by the OpenAI team! The paper does not discuss it, so I'll be the first to ask:

What's the release plan for the model definition & weights? Will it be tiered by size, like GPT-2?

JulianSlzr avatar May 29 '20 01:05 JulianSlzr

Yep! Please respond!

Devetec avatar May 29 '20 01:05 Devetec

...I'm not sure if it's even possible for the 175B model to be distributed in a reasonable manner.

The size of the 1.5B GPT-2 model was about 6GB on disk, which would imply that the 175B model is at least 700GB!

minimaxir avatar May 29 '20 02:05 minimaxir

I think it’s safe to say I won’t be replicating this one anytime soon

vanyacohen avatar May 29 '20 02:05 vanyacohen

...I'm not sure if it's even possible for the 175B model to be distributed in a reasonable manner.

Sure it is. Artifacts larger than 700GB are distributed all the time. I distribute Danbooru2019 via BitTorrent & rsync and that's like 3300GB! I would not advise distributing GPT-3 via GCP/AWS buckets, to say the least, but it would be easy and cheap ($30/month) to use a dedicated server to seed a GPT-3 torrent, for example.

gwern avatar May 29 '20 02:05 gwern

Not to detract from the difficulties of distributing the model, but the paper notes that training is performed in full half-precision, which would put the number of parameters at around 350GB.

parasj avatar May 29 '20 04:05 parasj

We need distilGPT-3!

Grandiferr avatar May 29 '20 06:05 Grandiferr

By comparison Nvidia Megatron 11B, trained by Facebook AI in fairseq is provided as 19GB tar gz file hosted on their server farm:

https://dl.fbaipublicfiles.com/fairseq/models/model_parallel/megatron_11b.tar.gz

loretoparisi avatar May 29 '20 06:05 loretoparisi

dang it. It is here finally.

theneuronprogrammer avatar May 29 '20 08:05 theneuronprogrammer

We need distilGPT-3!

maybe we need evaporation-GPT-3

nlp4whp avatar May 29 '20 14:05 nlp4whp

Most of us can hardly dream of using the full model. You'd need to partition it across more than (350 GB) / (16 GB) ~ 22 GPUs just to run it! Training with the Adam optimizer (as they mention) would require at least 3 times as many (~66 GPUs), plus extra space for the activations. There are more memory-efficient optimizers though.

But there are 8 models in the paper, 4 of which are smaller than GPT-2, so some of those will probably be useful if OpenAI chooses to release them. 🙂 image

AdamDanielKing avatar May 29 '20 14:05 AdamDanielKing

The FP16 point is good; that would mean the smaller models noted above would be even smaller than usual, which is good for everyone!

That may limit the supported hardware unless a way to cast up to FP32 is added. (likely something PyTorch can do)

minimaxir avatar May 29 '20 16:05 minimaxir

Fine-tuning for normal people is out of the question due to model size. Shouldn't inference still be possible if weights are loaded and applied incrementally? Especially if system rather than GPU memory is used for intermediate computations.

poset avatar May 29 '20 19:05 poset

Big gap between 13B and 175B; there's probably some sweet spots for a few folks in there if something could be made available.

fredbuhl avatar May 29 '20 20:05 fredbuhl

Fine-tuning for normal people is out of the question due to model size. Shouldn't inference still be possible if weights are loaded and applied incrementally? Especially if system rather than GPU memory is used for intermediate computations.

Technically you could do that, but it would be impractically slow. You'd still need at least 350 GB of RAM (some cloud instances have this) or you'd be waiting for disk -> RAM transfers of 350 GB for each token generated. For a 600 MB/s SSD that would take 10 minutes and cap the output speed at 6 tokens per hour.

With at least 350 GB of RAM the bottleneck would be RAM -> GPU transfers. If the speed is 2.3 GB/s that would take 2.5 minutes. So that caps the possible inference speed at 24 tokens per hour, or somewhere around 50 characters.

Edit: It might be faster to run fully on CPUs using >350 GB RAM than to transfer to the GPU for every token.

AdamDanielKing avatar May 29 '20 20:05 AdamDanielKing

...I'm not sure if it's even possible for the 175B model to be distributed in a reasonable manner.

The size of the 1.5B GPT-2 model was about 6GB on disk, which would imply that the 175B model is at least 700GB!

Still lower then recent Call of Duty games so.

ugurkanates avatar May 30 '20 10:05 ugurkanates

Gosh, I would really like to see something put together here to give people more access to this and tool around with it like GPT-2.

If openAI could release a cloud platform, I would gladly pay-to-play as I have disagreed with devs in the past on GPT release format. I think building a container system for language models could be the key to OpenAI making money they can reappropriate to research and also being fair to developers.

I really don’t think there is any danger in language models

4R7I5T avatar Jun 02 '20 03:06 4R7I5T