skillful_nowcasting icon indicating copy to clipboard operation
skillful_nowcasting copied to clipboard

Big GPU footprint while training

Open johmathe opened this issue 3 years ago • 8 comments

Describe the bug Not really a bug per se, more of a question/clarification request. If there are better avenues to discuss (discord server?) these issues I apologize for using the wrong channel.

To Reproduce Feel free to run this test: https://github.com/openclimatefix/skillful_nowcasting/blob/main/tests/test_model.py#L305

The training takes about 40G on the GPU with a batch size of 1. Had to upgrade to a A100 to be able to run decently.

Just curious if this is expected behavior or if you recommend another approach.

johmathe avatar Feb 07 '22 08:02 johmathe

No, this is a great place to talk about it! That test does use a lot of GPU memory, and I think that is just expected, the model is almost a 1 to 1 copy of the DeepMind psuedocode and training code they released, and they trained it on 16 TPUs. I would try to just run it with reduced parameters, or smaller input size. Unfortunately its just a large model. But that is why I skip that test in the CI actions.

jacobbieker avatar Feb 07 '22 09:02 jacobbieker

Ok this is what I thought.

I might play around with multi-gpu or TPU training with XLA to see if I can crank the batch size up. I am also curious if switching to a 16-bit architecture could help. The input data is 16 bits anyways.

If you're interested I also have some code to feed the tfrecords from the original dataset into the DataLoader (keeping the tf parallelism), happy to send a PR. I'm trying to reproduce the results from the paper.

johmathe avatar Feb 07 '22 10:02 johmathe

Yeah, a PR would be great! I'm currently working on mirroring the dataset on HuggingFace to make it easier for anyone else to reproduce the paper, but for now that code requires TF.

jacobbieker avatar Feb 07 '22 10:02 jacobbieker

Tf/sonnet implementation I have trains decently distributed across 8 tpu cores (v2) and matching paper params (except global batch size 8 per step, and one sample per input during gen. step). Assume DM trained theirs on the v3 cores. Full model seems to fit on a GCP n1-highmem-64 (416 gig ram), albeit slow and useless.

l4fl4m3 avatar Mar 23 '22 04:03 l4fl4m3

Thanks for the insights!

johmathe avatar Apr 22 '22 21:04 johmathe

Thanks for your questions. I have run this code with TeslaV100s(32G), unfortunately, it raises Cuda out of memory. Do you have some suggestions or configurations for training this model with GPU-RAM=32G? @johmathe @jacobbieker

Best regards.

GreenLimeSia avatar Jul 08 '22 14:07 GreenLimeSia

I'm working on adding a training script that uses Deepspeed which should help reduce the gpu memory requirements, albeit with some reduced training speed. Other than that, my only other suggestion is to use a smaller model or half precision training.

jacobbieker avatar Jul 08 '22 14:07 jacobbieker

Thanks for your quick reply. I will try to do it and hope your training script. @jacobbieker Best regards.

GreenLimeSia avatar Jul 09 '22 09:07 GreenLimeSia