gpt-neox icon indicating copy to clipboard operation
gpt-neox copied to clipboard

13B Model Out of Memory with Single Node 8 A100 GPUs

Open benathi opened this issue 3 years ago • 13 comments

Hi!

Thanks for contribution making this repo available :)

I tried to train the 13B model with micro batch size 1, model parallelism degree 8, but unable to get it to work. (always get OOM) The library advertises being able to scale up to 100B. What is required for this? I also tried deepspeed stage 3 with offload without using pipeline parallelism but that doesn't seem to work either. Please let me know what I'm missing. thanks!

benathi avatar Sep 16 '21 20:09 benathi

Can you post the exact config file you are using?

StellaAthena avatar Sep 16 '21 21:09 StellaAthena

Can you also provide details of your hardware?

EricHallahan avatar Sep 16 '21 22:09 EricHallahan

I’m using an AWS p4 node with 8 of A100 GPUs :)

On Thu, Sep 16, 2021 at 6:08 PM Eric Hallahan @.***> wrote:

Can you also provide details of your hardware?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/EleutherAI/gpt-neox/issues/409#issuecomment-921288211, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA5DMS4E4TCT6MFBXRDCQALUCJTG3ANCNFSM5EFSTT2Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

benathi avatar Sep 16 '21 22:09 benathi

I adapted the provided config 13B.yaml and changed the model parallelism degree to 8 with batch size = 1.

# GPT-2 pretraining setup
{
   # parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
   # across the node boundaries )
   "pipe-parallel-size": 1,
   "model-parallel-size": 8,

   # model settings
   "num-layers": 40,
   "hidden-size": 5120,
   "num-attention-heads": 40,
   "seq-length": 2048,
   "max-position-embeddings": 2048,
   "norm": "layernorm",
   "pos-emb": "rotary",
   "no-weight-tying": true,

   # these should provide some speedup but takes a while to build, set to true if desired
   "scaled-upper-triang-masked-softmax-fusion": false,
   "bias-gelu-fusion": false,

   # optimizer settings
   "optimizer": {
     "type": "Adam",
     "params": {
       "lr": 0.0001,
       "betas": [0.9, 0.999],
       "eps": 1.0e-8,
     }
   },
   "zero_optimization": {
    "stage": 1,
    "allgather_partitions": True,
    "allgather_bucket_size": 500000000,
    "overlap_comm": True,
    "reduce_scatter": True,
    "reduce_bucket_size": 500000000,
    "contiguous_gradients": True,
    "cpu_offload": False
  },

   # batch / data settings
   "train_micro_batch_size_per_gpu": 1,
   "data-impl": "mmap",
   "split": "949,50,1",

   # activation checkpointing
   "checkpoint-activations": true,
   "checkpoint-num-layers": 1,
   "partition-activations": true,
   "synchronize-each-layer": true,

   # regularization
   "gradient_clipping": 1.0,
   "weight-decay": 0,
   "hidden-dropout": 0,
   "attention-dropout": 0,

   # precision settings
   "fp16": { 
     "fp16": true,
     "enabled": true,
     "loss_scale": 0,
     "loss_scale_window": 1000,
     "hysteresis": 2,
     "min_loss_scale": 1
   },

   # misc. training settings
   "train-iters": 320000,
   "lr-decay-iters": 320000,
   "distributed-backend": "nccl",
   "lr-decay-style": "cosine",
   "warmup": 0.01,
   "save-interval": 10000,
   "eval-interval": 1000,
   "eval-iters": 10,

   # logging
   "log-interval": 100,
   "steps_per_print": 10,
   "keep-last-n-checkpoints": 4,
   "wall_clock_breakdown": true,
}

benathi avatar Sep 16 '21 23:09 benathi

The above config should run. Try setting scaled-upper-triang-masked-softmax-fusion and bias-gelu-fusion to True. This also saves on memory.

Also note that the above config has a train_micro_batch_size_per_gpu of 1. On 8 gpus that results in a data parallel of 1 (8 / pipe-parallel-size / model-parallel-size) and hence a batch size of 1. I suggest to find a good combination of micro batch size and gradient-accumulation steps to get a decent batch size. See here for calculation.

Having said that, a 13B will take a long time to train on 8 GPUs only.

sweinbach avatar Sep 17 '21 06:09 sweinbach

Thank you! I’ll try that and will let you know. Does the original config 13B.yml also run on a single node with 8GPUs / what hardware setup was it tested on? Thanks again for a fast reply :)

benathi avatar Sep 17 '21 14:09 benathi

Have not tested the 13B on a single node with 8 A100s. It is somewhat tricky to balance, too. Reason are the relatively large embedding and lm-head layers that take a lot of memory.

sweinbach avatar Sep 17 '21 16:09 sweinbach

I tried setting scaled-upper-triang-masked-softmax-fusion and bias-gelu-fusion to True but it doesn't seem to work, with the provided script for 13B or my modified script with lowering batch size to 1.

What hardware set up is the 13B config provided tested on btw? If it's not that many nodes I can try to replicate. Is the memory reduction from mostly Zero 1 as it splits opt across nodes?

Thanks :)

benathi avatar Sep 17 '21 21:09 benathi

I don't know the smallest hardware people have tried it on. On 8 GPUs I would estimate a training time of ~2 years though (16 GPUs ~ 1year). Seems not feasible.

sweinbach avatar Sep 21 '21 05:09 sweinbach

Hi!

Would like to ask if you managed to get it work eventually, Thanks.

seeEssex avatar Apr 14 '22 04:04 seeEssex

@seeEssex Why do you want to do this? It would take years to train the model even if it were to be made to fit.

StellaAthena avatar Apr 14 '22 04:04 StellaAthena

@StellaAthena I was trying to fit the model for finetuning, as opposed to training the whole thing.

Would that still take up a very significant time? Thanks

seeEssex avatar Apr 14 '22 05:04 seeEssex

@seeEssex there does not currently exist a public 13B model to finetune. The only model we have released so far that is larger than GPT-J is a 20B parameter model. I do know someone who is finetuning it, and can inquire about their hardware and performance.

StellaAthena avatar Apr 14 '22 13:04 StellaAthena

What is the hardware and speed for the person fine-tuning the 20B parameter model above?

I am interested in using GPT-NeoX-20B for fine-tuning. Would 2 servers of 8 A100s GPU be sufficient? From your repo, the model weights and optimizer states are a total size of 268GB. My intuition is that since 2 servers of 8 A100 GPUs has a total memory of 1280GB, it should be more than enough. However, given the relatively large embedding and lm-head layers, I wonder if it will be sufficient?

jennyzzt avatar Oct 25 '22 00:10 jennyzzt