dalle-playground icon indicating copy to clipboard operation
dalle-playground copied to clipboard

Out of GPU memory using rewritten backend

Open ethanproia opened this issue 3 years ago • 8 comments

  • using rewritten backend (consts.py, dalle_model.py, etc.) with a docker build on a 3060. Out of memory loading Mini on 12GB of VRAM?
  • Also needs a README update on instructions for using Mega.

cuda error out of memory dalle playground

dalle-backend | 2022-06-09 14:00:57.289965: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 198967552 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory

ethanproia avatar Jun 09 '22 14:06 ethanproia

I am having the same problem, Windows 11 WSL2 docker build on a 3080.

sparklerfish avatar Jun 10 '22 06:06 sparklerfish

i'm not using docker, running directly on win 10 wsl2, having same out of memory issue with 3090

raylin01 avatar Jun 10 '22 22:06 raylin01

Same here locally on a 3070 Ti.

I tried running it on an EC2 g4dn.2xlarge but same OOM. But honestly, I'm not sure what the best instance type for this is either, so I can't say if the ec2 would've worked anyway.

loofou avatar Jun 11 '22 07:06 loofou

Just an update, it shows the out of memory, but on certain instances, it will actually run after it loads it all in. Not sure what the root cause of the error is, but I can run the server (just let it run out for a while). I would start with dalle-mini (not mega), that should produce ~1 terminal of errors, then it will take a bit longer and it should load. I am unable to run both mini and mega at the same time though.

raylin01 avatar Jun 11 '22 17:06 raylin01

You are right, I can run the mini model locally even with the OOM errors. Mega doesn't work, but I also wasn't expecting it to work on my machine.

loofou avatar Jun 12 '22 08:06 loofou

I use WSL2 on Windows 10. I have an RTX 3080 16GB model, neither mini nor mega work for me using manual setup.

Jax environment variables (first of which mentioned in another issue) XLA_PYTHON_CLIENT_ALLOCATOR=platform XLA_PYTHON_CLIENT_PREALLOCATE=false Don't really seem to help.

I'm not sure of a better way to do it, but I watch my GPU VRAM usage with Task Manager. Starting up Mini, once it begins to have allocation errors (that total roughly up to 9GB), I only end up with about 2 GB allocated on the card after the fact, once the webserver starts.

The local web page also only shows "{success:true}" when I open it up in a browser, and I'm not sure where to go from here.

SC-004096 avatar Jun 16 '22 02:06 SC-004096

I was having the exact same error and I followed @raylin01's advice. I just waited it out and eventually it worked. I'm running the Mega Full model on WSL2, Windows 10 on an RTX 3090.

I was not having this error yesterday but I installed dalle flow in the meantime, and it was throwing this exact same error. So this error started after coming back from installing dalle flow. Maybe someone else is in the same situation?

ghost avatar Jun 20 '22 19:06 ghost

Is there anyone who can help solve this problem?

HelloException avatar Sep 18 '22 10:09 HelloException