Sequential cpu offload and VAE tiling to less VRAM requirement

Open rockerBOO opened this issue 9 months ago • 0 comments

I haven't been able to test this code myself but will try it at some point but this should make it more efficient in terms of VRAM usage. sequential cpu offload takes 1024x1024 down to around 1GB VRAM usage so should allow this to run on consumer GPU's but will take longer.

import torch
from diffusers import FlowMatchEulerDiscreteScheduler
from pipeline_flux import FluxPipeline
from transformer_flux import FluxTransformer2DModel

bfl_repo = "black-forest-labs/FLUX.1-dev"

scheduler_config = FlowMatchEulerDiscreteScheduler.load_config(bfl_repo, subfolder="scheduler")
scheduler_config.use_dynamic_shifting = False
scheduler = FlowMatchEulerDiscreteScheduler.from_config(scheduler_config)

transformer = FluxTransformer2DModel.from_pretrained(bfl_repo, subfolder="transformer", torch_dtype=torch.bfloat16)
pipe = FluxPipeline.from_pretrained(bfl_repo, transformer=transformer, torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload() #save some VRAM by offloading the model to CPU. Remove this if you have enough GPU power
pipe.enable_sequential_cpu_offload()
pipe.enable_tiling()

pipe.load_lora_weights("Huage001/URAE", weight_name="urae_2k_adapter.safetensors")

prompt = "An astronaut riding a green horse"
image = pipe(
    prompt,
    height=2048,
    width=2048,
    guidance_scale=3.5,
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0]
image.save("flux-urae.png")

Also might be good to document the other options

    proportional_attention=True,
    ntk_factor=10.0,

Mar 24 '25 23:03 rockerBOO