dalle-playground Clarify what constitutes a "strong local machine"

I've got an RTX 2070 SUPER 8GB and 32GB of system RAM.

When running the backend, I run into

2022-06-04 13:02:17.271909: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:479] Allocator (GPU_0_bfc) ran out of memory trying to allocate 768.00MiB (rounded to 805306368)requested by op 
2022-06-04 13:02:17.272293: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:491] *********************************************************************_****************************__
2022-06-04 13:02:17.273116: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2141] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 805306368 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation:    16.4KiB
              constant allocation:        64B
        maybe_live_out allocation:    5.25GiB
     preallocated temp allocation:  288.04MiB
  preallocated temp fragmentation:       496B (0.00%)
                 total allocation:    5.53GiB
              total fragmentation:     5.3KiB (0.00%)

But I'm not sure if it is due to a configuration error, or if my system simply cannot handle the Mega model. Does anyone know the minimum requirement for GPU RAM?

This sort of information should be added to the readme, if it is available.

Jun 04 '22 20:06 lbenedetto

I'm using a RTX2060, I'm also receiving errors related to out-of memory errors.

Jun 05 '22 14:06 Matheus-Garbelini

Are you trying to run the environment with DALL-E Mini or DALL-E Mega?

Jun 06 '22 03:06 saharmor

Hi Sahar, and thanks so much for this repo! I am wondering how much improvement on the images the Pro+ or local GPU can make. Do you have a sense of that? I started a question for Boris that references Eleutherai's "plans" to create a larger pretrained model (https://github.com/borisdayma/dalle-mini/issues/208). I wish this were something that distributed computing (like BIONC) could help, but I've heard it said the large memory requirements make DC solutions difficult.

Jun 06 '22 03:06 auwsom

mega-1 uses (together with the few hundred MiB of the VQGAN) a little under 12 GiB of VRAM.

You can use mega-1-fp16, it's half as big and almost as good. Note the parameter type for this model is jnp.float16, not bfloat16 (which is a leftover from an earlier attempt, I believe).

Jun 06 '22 04:06 drdaxxy

That's strange as mega-1 consumes 24GB on my RTX3090. I have built docker image and during build for dalle-backend I have a lot of info that it run out of memory.

Jun 06 '22 15:06 mikaczma

It turns out that JAX try to allocate to itself 90% of the GPU VRAM by default https://jax.readthedocs.io/en/latest/gpu_memory_allocation.html wich could be a problem if your GPU is also used for display. I got dalle-mini working by setting XLA_PYTHON_CLIENT_ALLOCATOR=platform

On my RTX 3070Ti 8GB, mini-1 + VQGAN use 5GB of VRAM. I was no able to run mega-1-fp16, it try to allocate f32 buffers, so I'm not sure the fp16 is doing anything.

Jun 06 '22 17:06 Hugi-R

Thanks @Hugi-R unfortunately, even with such fix my RTX2060 cannot run mini-1. So I had to change jax to use CPU instead.

Jun 06 '22 19:06 Matheus-Garbelini

Upstream's example notebook (current HEAD master@db1ed25) makes sure to only load the parameters onto your GPU(s) once, in float16, and with XLA's allocator adjusted, total VRAM use (including display) stays under 8GiB on my Windows system. Take a look at what's changed there vs. backend/app.py here (which is based on an older version of that code).

You should be able to run mini-1 with a 4GB GPU, mega-1-fp16 with an 8GB GPU and mega-1 with a 12GB GPU with the current reference notebook (but not what's in this repository right now).

Mini-1 likely even fits in 3GB, mega-1-fp16 in 6GB, if you move VQGAN to CPU and/or the card's not responsible for display.

@mikaczma As mentioned above, XLA by default assumes it's running on dedicated hardware and reserves 90% of VRAM as soon as you request any (which can already happen from import statements). One easy way to avoid this is to put import os; os.environ["XLA_PYTHON_CLIENT_ALLOCATOR"] = "platform" (or whatever behavior you want, see link above) at the top of your code.

Also, not all out-of-memory errors are fatal. JAX/XLA tries a couple times. I get OOMs every time I load the fp32 model, but also succeed every time, with plenty of space left.

Jun 06 '22 19:06 drdaxxy

@drdaxxy thank you for your explanation, I will try asap. That makes sense if JAX reserve most of possible resources. I should check docs for it.

Jun 06 '22 20:06 mikaczma

Hi all, thanks a lot for these insights. Adding this environment variable to the docker composer file did the trick. This is the following memory usage for RTX2060: Screenshot_20220607_051731

Hope this environment variable is added by default in this project

Jun 06 '22 21:06 Matheus-Garbelini

Hi, also there is a hint if you have two different cards in the machine ;) Set CUDA_VISIBLE_DEVICES=(number of card/s which are one type) eg. CUDA_VISIBLE_DEVICES=0 as I see JAX has issues when compiled and then run on different cards. dalle-mini consumes ~6.5GB on 3090, mega takes all I have :) 23GB but works and 3 images query time is ~20s

Jun 06 '22 21:06 mikaczma

Hi, also there is a hint if you have two different cards in the machine ;) Set CUDA_VISIBLE_DEVICES=(number of card/s which are one type) eg. CUDA_VISIBLE_DEVICES=0 as I see JAX has issues when compiled and then run on different cards. dalle-mini consumes ~6.5GB on 3090, mega takes all I have :) 23GB but works and 3 images query time is ~20s

Hey there, how is the image quality using your 3090? Is it much different than the Spaces demo online or the mega1-fp16? Thanks

Jun 06 '22 21:06 auwsom

No image quality will not change - sometimes I have even worse results. 9 images query takes ~120-140s with 70ish% GPU load

Jun 06 '22 21:06 mikaczma

vs

Jun 06 '22 22:06 mikaczma

No image quality will not change - sometimes I have even worse results. 9 images query takes ~120-140s with 70ish% GPU load

Thanks for the images, appreciate that. I find the smaller model struggles most with human faces. Could you share a couple of those if possible?

I'm assuming the bottleneck is ram size, not processing time. Either that or the gains from shuffling data in and out somehow to the training ram or not enough to warrant it. I wish someone like eleuthera would share a pre-trained model 👍,

Jun 06 '22 22:06 auwsom

I assume that this issue is not related to quality of the model, so this should be moved to another topic. But in fact even dalle-mega image has issues with faces (and e.g. cats, dogs heads). They are blurred or deformed. According to this article in April model was learned in 18%

Jun 06 '22 22:06 mikaczma

I've just pushed some updates as part of a PR that should accelerate the inference using DALL-E Mega. To run the playground with DALL-E Mega make sure to pass "mega" as another command line arg, i.e. !python dalle-playground/backend/app.py 8000 mega.

Jun 09 '22 07:06 saharmor

I can run dockerized version of the mega but not mega_full. I can't run the non-dockerized version because it can't find TPU for some reason.

I have 32GB or RAM and this:

Graphics:
  Device-1: NVIDIA GA104M [GeForce RTX 3070 Mobile / Max-Q] driver: nvidia
    v: 515.48.07
CPU:
  Info: 8-core model: AMD Ryzen 9 5900HX with Radeon Graphics bits: 64

5.17.14-1-MANJARO x86_64 GNU/Linux

9 images get generated in 70.67 sec

I also get RESOURCE_EXHAUSTED errors constantly. But I think that these are because I want to run much stuff at once and eventually it succeeds.

Jun 14 '22 11:06 Invertisment

dalle-playground dalle-playground copied to clipboard

Clarify what constitutes a "strong local machine"

dalle-playground
dalle-playground copied to clipboard