rq-vae-transformer
rq-vae-transformer copied to clipboard
Minimum GPU memory size for training RQ-Transformer
First of all, thank you all the authors for releasing this remarkable researches and models!
I tried to finetune this RQ-Transformer model(3.9B) at certain domain. (I'm already aware that it is impossible to release official training code.) In my training code, 'CUDA out of memory' error occurred with 8 NVIDIA RTX A6000(48GB) in training phase(optimizer step). (Batch size 1 per each device) I'm trying to find out reason of errors and alternative solutions.
So I have a question about minimum GPU memory size for this training task. I saw that NVIDIA A100 was used in your research paper. Was that 80GB memory? (I ask this because there are 2 versions in A100 GPU, 40GB/80GB.)
And should I implement 'model parallelism' code for this task with this resource? If your opinion is that learning process is possible with 48gb, I will look for the wrong part in my code.
I was able to do some tweaks to the configuration in their notebook and get it running on a single 3090 (24 GB memory.) Please see my PR: https://github.com/kakaobrain/rq-vae-transformer/pull/3
The memory requirement seemed to be dramatically lowered by disabling mixed precision.
Thanks for @ttt733 's pull request. @Baekpica , you can reduce the required memory size by disabling mixed precision.
We will update the example notebook soon.