bumblebee Got OOM message with GTX3060

I've been trying to Stable Diffusion with GPU.

But it failed and I got the OOM message

Is this error message due to insufficient GPU memory? Is it possible to make it work by adjusting some parameters? Stable Diffusion 1.4 is running on this GPU in the tensorflow environment. It would be nice if it works with bumblebee too.

it's working fine with :host . It's amazing how easy it is to use neural networks with livebooks!!!

OS Ubunt 22.04 on WSL2 GPU GTX3060(12GB) Livebook v0.8.0 Elixir v1.14.2 XLA_TARGET=cuda111 CUDA Version: 11.7

05:32:56.019 [info] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.

05:32:56.023 [info] XLA service 0x7fb39437dac0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:

05:32:56.023 [info]   StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6

05:32:56.023 [info] Using BFC allocator.

05:32:56.023 [info] XLA backend allocating 10641368678 bytes on device 0 for BFCAllocator.

05:32:58.662 [info] Start cannot spawn child process: No such file or directory

05:34:00.234 [info] total_region_allocated_bytes_: 10641368576 memory_limit_: 10641368678 available bytes: 102 curr_region_allocation_bytes_: 21282737664

05:34:00.234 [info] Stats: 
Limit:                     10641368678
InUse:                      5530766592
MaxInUse:                   7566778624
NumAllocs:                        3199
MaxAllocSize:                399769600
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0

05:34:00.234 [warn] **********___***********************************************************____________________________

05:34:00.234 [error] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 3546709984 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation:    3.84GiB
              constant allocation:       144B
        maybe_live_out allocation:   768.0KiB
     preallocated temp allocation:    3.30GiB
  preallocated temp fragmentation:       304B (0.00%)
                 total allocation:    7.15GiB
              total fragmentation:   821.0KiB (0.01%)

whole log is oommessage.log

Dec 09 '22 21:12 masahiro-999

We are likely being more inefficient than TensorFlow somewhere. This might be related: https://github.com/elixir-nx/nx/issues/1003

One thing you can try is mixed precision in all of the models:

policy = Axon.MixedPrecision.create_policy(compute: :f16)

# do this for every model
{:ok, %{model: clip_model} = clip} = Bumblebee.load_model({:hf, repository_id, subdir: "text_encoder"})
clip = %{clip | model: Axon.MixedPrecision.apply_policy(clip, policy)}

Note I haven't tested if this would affect image outputs or not

Dec 09 '22 21:12 seanmor5

I tried code like this. This didn't help. I got same OOM message.

policy = Axon.MixedPrecision.create_policy(compute: :f16)

{:ok, clip} =
  Bumblebee.load_model({:hf, repository_id, subdir: "text_encoder"},
    log_params_diff: false
  )
clip = %{clip | model: Axon.MixedPrecision.apply_policy(clip.model, policy)}

{:ok, unet} =
  Bumblebee.load_model({:hf, repository_id, subdir: "unet"},
    params_filename: "diffusion_pytorch_model.bin",
    log_params_diff: false
  )
unet = %{unet | model: Axon.MixedPrecision.apply_policy(unet.model, policy)}

{:ok, vae} =
  Bumblebee.load_model({:hf, repository_id, subdir: "vae"},
    architecture: :decoder,
    params_filename: "diffusion_pytorch_model.bin",
    log_params_diff: false
  )
vae = %{vae | model: Axon.MixedPrecision.apply_policy(vae.model, policy)}

{:ok, safety_checker} =
  Bumblebee.load_model({:hf, repository_id, subdir: "safety_checker"},
    log_params_diff: false
  )
safety_checker = %{safety_checker | model: Axon.MixedPrecision.apply_policy(safety_checker.model, policy)}

Dec 09 '22 22:12 masahiro-999

I see this as well, which is probably expected in that I have only 6 GB.

I will note that I can run things like InvokeAI and do text2img with only 6 GB (and I believe InvokeAI is using the same type of lowered precision to achieve that).

My specs:

 nvidia-smi 
Sat Dec 10 00:00:46 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.56.06    Driver Version: 520.56.06    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   38C    P8     6W / 120W |     15MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     14189      G   ...xorg-server-1.20.14/bin/X        9MiB |
|    0   N/A  N/A     14217      G   ...hell-43.1/bin/gnome-shell        2MiB |
+-----------------------------------------------------------------------------+

Dec 10 '22 00:12 xrd

I set the following Policy and confirmed that the image can be generated for the host client(not cuda).

policy =
  Axon.MixedPrecision.create_policy(
    params: {:f, 16},
    compute: {:f, 32},
    output: {:f, 16}
  )

clip = %{clip | model: Axon.MixedPrecision.apply_policy(clip.model, policy)}
unet = %{unet | model: Axon.MixedPrecision.apply_policy(unet.model, policy)}
vae = %{vae | model: Axon.MixedPrecision.apply_policy(vae.model, policy)}
safety_checker = %{
  safety_checker
  | model: Axon.MixedPrecision.apply_policy(safety_checker.model, policy)
}

serving =
  Bumblebee.Diffusion.StableDiffusion.text_to_image(clip, unet, vae, tokenizer, scheduler,
    num_steps: 10,
    num_images_per_prompt: 1,
    safety_checker: safety_checker,
    safety_checker_featurizer: featurizer,
    compile: [batch_size: 1, sequence_length: 50],
    defn_options: [compiler: EXLA]
  )

OOM occurs when running in cuda.

Looking at the Peak buffers included in the OOM message, the Shape is f32. Is there no policy effect, or is it a memory problem unrelated to the policy?

Peak buffers:
	Buffer 1:
		Size: 1.00GiB
		XLA Label: custom-call
		Shape: f32[2,8,4096,4096]
		==========================

	Buffer 2:
		Size: 144.75MiB
		Entry Parameter Subshape: f32[49408,768]
		==========================

Dec 10 '22 06:12 masahiro-999

Yes, it can also be that there are places where we could improve the model efficiency. There are some PRs in the diffusers repo and some Twitter threads:

https://mobile.twitter.com/Nouamanetazi/status/1576959648912973826
https://mobile.twitter.com/pcuenq/status/1590665645233881089
https://mobile.twitter.com/realDanFu/status/1580641495991754752
https://github.com/huggingface/diffusers/pull/366
https://github.com/huggingface/diffusers/pull/532

@seanmor5, do you know what we need to do to generate graphs such as this one? https://github.com/huggingface/diffusers/pull/371

Dec 10 '22 10:12 josevalim

Forwarded here from the above issue. Is there anyway for me to give bumblebee more of my memory? Do I need to simply increase the amount of memory I have?

Dec 16 '22 08:12 krainboltgreene

You have 4GB right? That’s currently not enough for SD.

Dec 16 '22 11:12 josevalim

No the VM I run this on has 8GB and the GPU I have has 6GB.

Dec 16 '22 17:12 krainboltgreene

@krainboltgreene we have some experiments that have brought it down to 5GB for a single image. We will be publishing them in the coming weeks.

Jan 12 '23 15:01 josevalim

That is incredible. I have been wanting to dive much deeper into how bumblebee/nx work because I would love to contribute even more to the various APIs. Excited to see the source and learn more.

Jan 12 '23 18:01 krainboltgreene

Opened #147 with a more principled approach.

Jan 12 '23 21:01 josevalim

bumblebee bumblebee copied to clipboard

Got OOM message with GTX3060

bumblebee
bumblebee copied to clipboard