flux
flux copied to clipboard
CUDA out of memory PROBLEM SOLUTION
Reason of this issue in really big models, which are more than 60GB. So diffusers tries to put all of them to GPU VRAM. Now there are couple ways to fix it.
First one is to add this line of code to your script:
pipe.enable_sequential_cpu_offload()
You will now be able start your scripts, bit it will be kinda slow.
Second way is to quantize your models. Here I write the examples of code for different ways of using with different models:
# This one is for using with Flux.1-dev for generating images
import torch
from diffusers import FluxTransformer2DModel, FluxPipeline
model_id = "black-forest-labs/FLUX.1-dev"
nf4_id = "sayakpaul/flux.1-dev-nf4-with-bnb-integration"
model_nf4 = FluxTransformer2DModel.from_pretrained(nf4_id, torch_dtype=torch.bfloat16)
print(model_nf4.dtype)
print(model_nf4.config.quantization_config)
pipe = FluxPipeline.from_pretrained(model_id, transformer=model_nf4, torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()
prompt = "A mystic cat with a sign that says hello world!"
image = pipe(prompt, guidance_scale=3.5, num_inference_steps=50, generator=torch.manual_seed(0)).images[0]
image.save("flux-nf4-dev-loaded.png")
# this one for upscaling images with jasperai/Flux.1-dev-Controlnet-Upscaler
import torch
from diffusers.utils import load_image
from diffusers import FluxControlNetModel, BitsAndBytesConfig, FluxTransformer2DModel
from diffusers.pipelines import FluxControlNetPipeline
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16
)
controlnet = FluxControlNetModel.from_pretrained(
"jasperai/Flux.1-dev-Controlnet-Upscaler",
quantization_config=nf4_config,
)
model_id = "black-forest-labs/FLUX.1-dev"
nf4_id = "sayakpaul/flux.1-dev-nf4-with-bnb-integration"
model_nf4 = FluxTransformer2DModel.from_pretrained(nf4_id, torch_dtype=torch.float16)
pipe = FluxControlNetPipeline.from_pretrained(
model_id,
transformer=model_nf4,
torch_dtype=torch.float16,
controlnet=controlnet
)
pipe.enable_model_cpu_offload()
control_image = load_image(
"image.jpg"
)
image = pipe(
prompt="",
control_image=control_image,
controlnet_conditioning_scale=0.6,
num_inference_steps=28,
guidance_scale=3.5,
height=control_image.size[1],
width=control_image.size[0]
).images[0]
image.save("upscaled_img_quanted.png")
For this solutions we must to say thank you to @sayakpaul
Hi, @VadimPoliakov
I am using A10 GPU 48 VRAM in run pod which is ample for the flux model it is running smoothly in jupyter notebook. But while deployment with fastapi I am getting issue of cuda out of memory issue.
This issue is with also for quantized model.
Any help would be appreciated.
Thanks!
cc @sayakpaul
Hi, @VadimPoliakov I am using A10 GPU 48 VRAM in run pod which is ample for the flux model it is running smoothly in jupyter notebook. But while deployment with fastapi I am getting issue of cuda out of memory issue. This issue is with also for quantized model. Any help would be appreciated. Thanks! cc @sayakpaul
Hi. I`m not sure. But it seems like problem with simultaneously proccessing more than 1 images. Try to use queues for that.
No ,the problem is with when you stage your deployment, instead of starting an API gives out cuda memory issue .
No ,the problem is with when you stage your deployment, instead of starting an API gives out cuda memory issue .
If you start on several workers, it means several times diffusers tries to put all models to GPU VRAM. Make the separate service not FastAPI with queues with no workers. And in your FastAPI service just use this service.
Thanks bro! For the help.
not run on colab t4
import torch from diffusers import FluxTransformer2DModel, FluxPipeline
model_id = "black-forest-labs/FLUX.1-dev" nf4_id = "sayakpaul/flux.1-dev-nf4-with-bnb-integration" model_nf4 = FluxTransformer2DModel.from_pretrained(nf4_id, torch_dtype=torch.bfloat16) print(model_nf4.dtype) print(model_nf4.config.quantization_config)
pipe = FluxPipeline.from_pretrained(model_id, transformer=model_nf4, torch_dtype=torch.bfloat16) #pipe.enable_model_cpu_offload() #pipe.enable_sequential_cpu_offload()
prompt = "A mystic cat with a sign that says hello world!" image = pipe(prompt, guidance_scale=3.5, num_inference_steps=3, generator=torch.manual_seed(0)).images[0] image.save("flux-nf4-dev-loaded.png")
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret HF_TOKEN does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'diffusers.quantizers.quantization_config.BitsAndBytesConfig'>.
torch.uint8
BitsAndBytesConfig {
"_load_in_4bit": true,
"_load_in_8bit": false,
"bnb_4bit_compute_dtype": "bfloat16",
"bnb_4bit_quant_storage": "uint8",
"bnb_4bit_quant_type": "nf4",
"bnb_4bit_use_double_quant": false,
"llm_int8_enable_fp32_cpu_offload": false,
"llm_int8_has_fp16_weight": false,
"llm_int8_skip_modules": null,
"llm_int8_threshold": 6.0,
"load_in_4bit": true,
"load_in_8bit": false,
"quant_method": "bitsandbytes"
}
Loading pipeline components...: 100%
7/7 [00:02<00:00, 3.02it/s]
Loading checkpoint shards: 100%
2/2 [00:01<00:00, 1.69it/s]
You set add_prefix_space. The tokenizer needs t
Executing (22m 1s)
Still working
@werruww just create your token on Hugginface
just create your token on Hugginface
how?????
my Access Tokens
??????
how?????
https://huggingface.co/settings/tokens
Reason of this issue in really big models, which are more than 60GB. So diffusers tries to put all of them to GPU VRAM. Now there are couple ways to fix it.
First one is to add this line of code to your script:
pipe.enable_sequential_cpu_offload()You will now be able start your scripts, bit it will be kinda slow.
Second way is to quantize your models. Here I write the examples of code for different ways of using with different models:
# This one is for using with Flux.1-dev for generating images import torch from diffusers import FluxTransformer2DModel, FluxPipeline model_id = "black-forest-labs/FLUX.1-dev" nf4_id = "sayakpaul/flux.1-dev-nf4-with-bnb-integration" model_nf4 = FluxTransformer2DModel.from_pretrained(nf4_id, torch_dtype=torch.bfloat16) print(model_nf4.dtype) print(model_nf4.config.quantization_config) pipe = FluxPipeline.from_pretrained(model_id, transformer=model_nf4, torch_dtype=torch.bfloat16) pipe.enable_model_cpu_offload() prompt = "A mystic cat with a sign that says hello world!" image = pipe(prompt, guidance_scale=3.5, num_inference_steps=50, generator=torch.manual_seed(0)).images[0] image.save("flux-nf4-dev-loaded.png")# this one for upscaling images with jasperai/Flux.1-dev-Controlnet-Upscaler import torch from diffusers.utils import load_image from diffusers import FluxControlNetModel, BitsAndBytesConfig, FluxTransformer2DModel from diffusers.pipelines import FluxControlNetPipeline nf4_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.float16 ) controlnet = FluxControlNetModel.from_pretrained( "jasperai/Flux.1-dev-Controlnet-Upscaler", quantization_config=nf4_config, ) model_id = "black-forest-labs/FLUX.1-dev" nf4_id = "sayakpaul/flux.1-dev-nf4-with-bnb-integration" model_nf4 = FluxTransformer2DModel.from_pretrained(nf4_id, torch_dtype=torch.float16) pipe = FluxControlNetPipeline.from_pretrained( model_id, transformer=model_nf4, torch_dtype=torch.float16, controlnet=controlnet ) pipe.enable_model_cpu_offload() control_image = load_image( "image.jpg" ) image = pipe( prompt="", control_image=control_image, controlnet_conditioning_scale=0.6, num_inference_steps=28, guidance_scale=3.5, height=control_image.size[1], width=control_image.size[0] ).images[0] image.save("upscaled_img_quanted.png")For this solutions we must to say thank you to @sayakpaul
This solution does not fit 24GB VRAM in my case (controlnet version), what is your hardware for that?
@Oguzhanercan My hardware is nvidia 3090 with 24GB VRAM. When you use controlnet, this model has to be quantized too, how its described in solution.
@VadimPoliakov I could not quantize the controlnet for some reasons that I cannot remember right now, so I used sequential offload to reduce memory usage. Thanks for reply