flux icon indicating copy to clipboard operation
flux copied to clipboard

CUDA out of memory PROBLEM SOLUTION

Open VadimPoliakov opened this issue 1 year ago • 14 comments
trafficstars

Reason of this issue in really big models, which are more than 60GB. So diffusers tries to put all of them to GPU VRAM. Now there are couple ways to fix it.

First one is to add this line of code to your script:

pipe.enable_sequential_cpu_offload()

You will now be able start your scripts, bit it will be kinda slow.

Second way is to quantize your models. Here I write the examples of code for different ways of using with different models:

# This one is for using with Flux.1-dev for generating images
import torch
from diffusers import FluxTransformer2DModel, FluxPipeline

model_id = "black-forest-labs/FLUX.1-dev"
nf4_id = "sayakpaul/flux.1-dev-nf4-with-bnb-integration"
model_nf4 = FluxTransformer2DModel.from_pretrained(nf4_id, torch_dtype=torch.bfloat16)
print(model_nf4.dtype)
print(model_nf4.config.quantization_config)

pipe = FluxPipeline.from_pretrained(model_id, transformer=model_nf4, torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()

prompt = "A mystic cat with a sign that says hello world!"
image = pipe(prompt, guidance_scale=3.5, num_inference_steps=50, generator=torch.manual_seed(0)).images[0]
image.save("flux-nf4-dev-loaded.png")
# this one for upscaling images with jasperai/Flux.1-dev-Controlnet-Upscaler
import torch
from diffusers.utils import load_image
from diffusers import FluxControlNetModel, BitsAndBytesConfig, FluxTransformer2DModel
from diffusers.pipelines import FluxControlNetPipeline


nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

controlnet = FluxControlNetModel.from_pretrained(
    "jasperai/Flux.1-dev-Controlnet-Upscaler",
    quantization_config=nf4_config,
)

model_id = "black-forest-labs/FLUX.1-dev"
nf4_id = "sayakpaul/flux.1-dev-nf4-with-bnb-integration"
model_nf4 = FluxTransformer2DModel.from_pretrained(nf4_id, torch_dtype=torch.float16)

pipe = FluxControlNetPipeline.from_pretrained(
    model_id,
    transformer=model_nf4,
    torch_dtype=torch.float16,
    controlnet=controlnet
)
pipe.enable_model_cpu_offload()

control_image = load_image(
    "image.jpg"
)

image = pipe(
    prompt="", 
    control_image=control_image,
    controlnet_conditioning_scale=0.6,
    num_inference_steps=28, 
    guidance_scale=3.5,
    height=control_image.size[1],
    width=control_image.size[0]
).images[0]
image.save("upscaled_img_quanted.png")

For this solutions we must to say thank you to @sayakpaul

VadimPoliakov avatar Nov 01 '24 10:11 VadimPoliakov

Hi, @VadimPoliakov
I am using A10 GPU 48 VRAM in run pod which is ample for the flux model it is running smoothly in jupyter notebook. But while deployment with fastapi I am getting issue of cuda out of memory issue. This issue is with also for quantized model.
Any help would be appreciated. Thanks! cc @sayakpaul

Jaid844 avatar Nov 01 '24 14:11 Jaid844

Hi, @VadimPoliakov I am using A10 GPU 48 VRAM in run pod which is ample for the flux model it is running smoothly in jupyter notebook. But while deployment with fastapi I am getting issue of cuda out of memory issue. This issue is with also for quantized model. Any help would be appreciated. Thanks! cc @sayakpaul

Hi. I`m not sure. But it seems like problem with simultaneously proccessing more than 1 images. Try to use queues for that.

VadimPoliakov avatar Nov 01 '24 14:11 VadimPoliakov

No ,the problem is with when you stage your deployment, instead of starting an API gives out cuda memory issue .

Jaid844 avatar Nov 01 '24 14:11 Jaid844

No ,the problem is with when you stage your deployment, instead of starting an API gives out cuda memory issue .

If you start on several workers, it means several times diffusers tries to put all models to GPU VRAM. Make the separate service not FastAPI with queues with no workers. And in your FastAPI service just use this service.

VadimPoliakov avatar Nov 01 '24 15:11 VadimPoliakov

Thanks bro! For the help.

Jaid844 avatar Nov 02 '24 10:11 Jaid844

not run on colab t4

werruww avatar Nov 17 '24 21:11 werruww

import torch from diffusers import FluxTransformer2DModel, FluxPipeline

model_id = "black-forest-labs/FLUX.1-dev" nf4_id = "sayakpaul/flux.1-dev-nf4-with-bnb-integration" model_nf4 = FluxTransformer2DModel.from_pretrained(nf4_id, torch_dtype=torch.bfloat16) print(model_nf4.dtype) print(model_nf4.config.quantization_config)

pipe = FluxPipeline.from_pretrained(model_id, transformer=model_nf4, torch_dtype=torch.bfloat16) #pipe.enable_model_cpu_offload() #pipe.enable_sequential_cpu_offload()

prompt = "A mystic cat with a sign that says hello world!" image = pipe(prompt, guidance_scale=3.5, num_inference_steps=3, generator=torch.manual_seed(0)).images[0] image.save("flux-nf4-dev-loaded.png")

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: The secret HF_TOKEN does not exist in your Colab secrets. To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session. You will be able to reuse this secret in all of your notebooks. Please note that authentication is recommended but still optional to access public models or datasets. warnings.warn( Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'diffusers.quantizers.quantization_config.BitsAndBytesConfig'>. torch.uint8 BitsAndBytesConfig { "_load_in_4bit": true, "_load_in_8bit": false, "bnb_4bit_compute_dtype": "bfloat16", "bnb_4bit_quant_storage": "uint8", "bnb_4bit_quant_type": "nf4", "bnb_4bit_use_double_quant": false, "llm_int8_enable_fp32_cpu_offload": false, "llm_int8_has_fp16_weight": false, "llm_int8_skip_modules": null, "llm_int8_threshold": 6.0, "load_in_4bit": true, "load_in_8bit": false, "quant_method": "bitsandbytes" }

Loading pipeline components...: 100%  7/7 [00:02<00:00,  3.02it/s] Loading checkpoint shards: 100%  2/2 [00:01<00:00,  1.69it/s] You set add_prefix_space. The tokenizer needs t

Executing (22m 1s)
Still working

werruww avatar Nov 17 '24 22:11 werruww

@werruww just create your token on Hugginface

VadimPoliakov avatar Nov 18 '24 08:11 VadimPoliakov

just create your token on Hugginface

how?????

werruww avatar Nov 20 '24 23:11 werruww

my Access Tokens

??????

werruww avatar Nov 20 '24 23:11 werruww

how?????

https://huggingface.co/settings/tokens

geronimi73 avatar Nov 20 '24 23:11 geronimi73

Reason of this issue in really big models, which are more than 60GB. So diffusers tries to put all of them to GPU VRAM. Now there are couple ways to fix it.

First one is to add this line of code to your script:

pipe.enable_sequential_cpu_offload()

You will now be able start your scripts, bit it will be kinda slow.

Second way is to quantize your models. Here I write the examples of code for different ways of using with different models:

# This one is for using with Flux.1-dev for generating images
import torch
from diffusers import FluxTransformer2DModel, FluxPipeline

model_id = "black-forest-labs/FLUX.1-dev"
nf4_id = "sayakpaul/flux.1-dev-nf4-with-bnb-integration"
model_nf4 = FluxTransformer2DModel.from_pretrained(nf4_id, torch_dtype=torch.bfloat16)
print(model_nf4.dtype)
print(model_nf4.config.quantization_config)

pipe = FluxPipeline.from_pretrained(model_id, transformer=model_nf4, torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()

prompt = "A mystic cat with a sign that says hello world!"
image = pipe(prompt, guidance_scale=3.5, num_inference_steps=50, generator=torch.manual_seed(0)).images[0]
image.save("flux-nf4-dev-loaded.png")
# this one for upscaling images with jasperai/Flux.1-dev-Controlnet-Upscaler
import torch
from diffusers.utils import load_image
from diffusers import FluxControlNetModel, BitsAndBytesConfig, FluxTransformer2DModel
from diffusers.pipelines import FluxControlNetPipeline


nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

controlnet = FluxControlNetModel.from_pretrained(
    "jasperai/Flux.1-dev-Controlnet-Upscaler",
    quantization_config=nf4_config,
)

model_id = "black-forest-labs/FLUX.1-dev"
nf4_id = "sayakpaul/flux.1-dev-nf4-with-bnb-integration"
model_nf4 = FluxTransformer2DModel.from_pretrained(nf4_id, torch_dtype=torch.float16)

pipe = FluxControlNetPipeline.from_pretrained(
    model_id,
    transformer=model_nf4,
    torch_dtype=torch.float16,
    controlnet=controlnet
)
pipe.enable_model_cpu_offload()

control_image = load_image(
    "image.jpg"
)

image = pipe(
    prompt="", 
    control_image=control_image,
    controlnet_conditioning_scale=0.6,
    num_inference_steps=28, 
    guidance_scale=3.5,
    height=control_image.size[1],
    width=control_image.size[0]
).images[0]
image.save("upscaled_img_quanted.png")

For this solutions we must to say thank you to @sayakpaul

This solution does not fit 24GB VRAM in my case (controlnet version), what is your hardware for that?

Oguzhanercan avatar Nov 21 '24 10:11 Oguzhanercan

@Oguzhanercan My hardware is nvidia 3090 with 24GB VRAM. When you use controlnet, this model has to be quantized too, how its described in solution.

VadimPoliakov avatar Nov 21 '24 14:11 VadimPoliakov

@VadimPoliakov I could not quantize the controlnet for some reasons that I cannot remember right now, so I used sequential offload to reduce memory usage. Thanks for reply

Oguzhanercan avatar Nov 22 '24 07:11 Oguzhanercan