dalle-playground
dalle-playground copied to clipboard
Clarify what constitutes a "strong local machine"
I've got an RTX 2070 SUPER 8GB and 32GB of system RAM.
When running the backend, I run into
2022-06-04 13:02:17.271909: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 768.00MiB (rounded to 805306368)requested by op
2022-06-04 13:02:17.272293: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:491] *********************************************************************_****************************__
2022-06-04 13:02:17.273116: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2141] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 805306368 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
parameter allocation: 16.4KiB
constant allocation: 64B
maybe_live_out allocation: 5.25GiB
preallocated temp allocation: 288.04MiB
preallocated temp fragmentation: 496B (0.00%)
total allocation: 5.53GiB
total fragmentation: 5.3KiB (0.00%)
But I'm not sure if it is due to a configuration error, or if my system simply cannot handle the Mega model. Does anyone know the minimum requirement for GPU RAM?
This sort of information should be added to the readme, if it is available.
I'm using a RTX2060, I'm also receiving errors related to out-of memory errors.
Are you trying to run the environment with DALL-E Mini or DALL-E Mega?
Hi Sahar, and thanks so much for this repo! I am wondering how much improvement on the images the Pro+ or local GPU can make. Do you have a sense of that? I started a question for Boris that references Eleutherai's "plans" to create a larger pretrained model (https://github.com/borisdayma/dalle-mini/issues/208). I wish this were something that distributed computing (like BIONC) could help, but I've heard it said the large memory requirements make DC solutions difficult.
mega-1
uses (together with the few hundred MiB of the VQGAN) a little under 12 GiB of VRAM.
You can use mega-1-fp16
, it's half as big and almost as good. Note the parameter type for this model is jnp.float16
, not bfloat16
(which is a leftover from an earlier attempt, I believe).
That's strange as mega-1 consumes 24GB on my RTX3090. I have built docker image and during build for dalle-backend I have a lot of info that it run out of memory.
It turns out that JAX try to allocate to itself 90% of the GPU VRAM by default https://jax.readthedocs.io/en/latest/gpu_memory_allocation.html wich could be a problem if your GPU is also used for display.
I got dalle-mini working by setting XLA_PYTHON_CLIENT_ALLOCATOR=platform
On my RTX 3070Ti 8GB, mini-1 + VQGAN use 5GB of VRAM. I was no able to run mega-1-fp16, it try to allocate f32 buffers, so I'm not sure the fp16 is doing anything.
Thanks @Hugi-R unfortunately, even with such fix my RTX2060 cannot run mini-1. So I had to change jax to use CPU instead.
Upstream's example notebook (current HEAD master@db1ed25) makes sure to only load the parameters onto your GPU(s) once, in float16, and with XLA's allocator adjusted, total VRAM use (including display) stays under 8GiB on my Windows system. Take a look at what's changed there vs. backend/app.py
here (which is based on an older version of that code).
You should be able to run mini-1 with a 4GB GPU, mega-1-fp16 with an 8GB GPU and mega-1 with a 12GB GPU with the current reference notebook (but not what's in this repository right now).
Mini-1 likely even fits in 3GB, mega-1-fp16 in 6GB, if you move VQGAN to CPU and/or the card's not responsible for display.
@mikaczma As mentioned above, XLA by default assumes it's running on dedicated hardware and reserves 90% of VRAM as soon as you request any (which can already happen from import statements). One easy way to avoid this is to put import os; os.environ["XLA_PYTHON_CLIENT_ALLOCATOR"] = "platform"
(or whatever behavior you want, see link above) at the top of your code.
Also, not all out-of-memory errors are fatal. JAX/XLA tries a couple times. I get OOMs every time I load the fp32 model, but also succeed every time, with plenty of space left.
@drdaxxy thank you for your explanation, I will try asap. That makes sense if JAX reserve most of possible resources. I should check docs for it.
Hi all, thanks a lot for these insights. Adding this environment variable to the docker composer file did the trick.
This is the following memory usage for RTX2060:
Hope this environment variable is added by default in this project
Hi, also there is a hint if you have two different cards in the machine ;)
Set CUDA_VISIBLE_DEVICES=(number of card/s which are one type)
eg. CUDA_VISIBLE_DEVICES=0
as I see JAX has issues when compiled and then run on different cards.
dalle-mini consumes ~6.5GB on 3090, mega takes all I have :) 23GB but works and 3 images query time is ~20s
Hi, also there is a hint if you have two different cards in the machine ;) Set
CUDA_VISIBLE_DEVICES=(number of card/s which are one type)
eg.CUDA_VISIBLE_DEVICES=0
as I see JAX has issues when compiled and then run on different cards. dalle-mini consumes ~6.5GB on 3090, mega takes all I have :) 23GB but works and 3 images query time is ~20s
Hey there, how is the image quality using your 3090? Is it much different than the Spaces demo online or the mega1-fp16? Thanks
No image quality will not change - sometimes I have even worse results. 9 images query takes ~120-140s with 70ish% GPU load
vs
No image quality will not change - sometimes I have even worse results. 9 images query takes ~120-140s with 70ish% GPU load
Thanks for the images, appreciate that. I find the smaller model struggles most with human faces. Could you share a couple of those if possible?
I'm assuming the bottleneck is ram size, not processing time. Either that or the gains from shuffling data in and out somehow to the training ram or not enough to warrant it. I wish someone like eleuthera would share a pre-trained model 👍,
I assume that this issue is not related to quality of the model, so this should be moved to another topic. But in fact even dalle-mega image has issues with faces (and e.g. cats, dogs heads). They are blurred or deformed. According to this article in April model was learned in 18%
I've just pushed some updates as part of a PR that should accelerate the inference using DALL-E Mega.
To run the playground with DALL-E Mega make sure to pass "mega" as another command line arg, i.e. !python dalle-playground/backend/app.py 8000 mega
.
I can run dockerized version of the mega but not mega_full. I can't run the non-dockerized version because it can't find TPU for some reason.
I have 32GB or RAM and this:
Graphics:
Device-1: NVIDIA GA104M [GeForce RTX 3070 Mobile / Max-Q] driver: nvidia
v: 515.48.07
CPU:
Info: 8-core model: AMD Ryzen 9 5900HX with Radeon Graphics bits: 64
5.17.14-1-MANJARO x86_64 GNU/Linux
9 images get generated in 70.67 sec
I also get RESOURCE_EXHAUSTED
errors constantly. But I think that these are because I want to run much stuff at once and eventually it succeeds.