rq-vae-transformer icon indicating copy to clipboard operation
rq-vae-transformer copied to clipboard

Minimum GPU memory size for training RQ-Transformer

Open Baekpica opened this issue 2 years ago • 1 comments

First of all, thank you all the authors for releasing this remarkable researches and models!

I tried to finetune this RQ-Transformer model(3.9B) at certain domain. (I'm already aware that it is impossible to release official training code.) In my training code, 'CUDA out of memory' error occurred with 8 NVIDIA RTX A6000(48GB) in training phase(optimizer step). (Batch size 1 per each device) I'm trying to find out reason of errors and alternative solutions.

So I have a question about minimum GPU memory size for this training task. I saw that NVIDIA A100 was used in your research paper. Was that 80GB memory? (I ask this because there are 2 versions in A100 GPU, 40GB/80GB.)

And should I implement 'model parallelism' code for this task with this resource? If your opinion is that learning process is possible with 48gb, I will look for the wrong part in my code.

Baekpica avatar Apr 27 '22 08:04 Baekpica

I was able to do some tweaks to the configuration in their notebook and get it running on a single 3090 (24 GB memory.) Please see my PR: https://github.com/kakaobrain/rq-vae-transformer/pull/3

The memory requirement seemed to be dramatically lowered by disabling mixed precision.

ttt733 avatar May 09 '22 17:05 ttt733

Thanks for @ttt733 's pull request. @Baekpica , you can reduce the required memory size by disabling mixed precision.

We will update the example notebook soon.

LeeDoYup avatar Sep 06 '22 17:09 LeeDoYup