ComfyUI icon indicating copy to clipboard operation
ComfyUI copied to clipboard

[Feature request] Option to use CPU for VAEEncode/Decode

Open no-connections opened this issue 2 years ago • 9 comments

VAE in --novram uses about double the VRAM than the actual SD calculations, meaning it's the only reason for the out of VRAM crashes. Using the experimental tiled VAE is really a game changer since it allowed very high resolutions to be achieved even on lower VRAM cards. But tiles isn't without issues with seams having to happen somewhere. Would it be possible to mix and match CPU/GPU nodes and run VAE on the CPU while keeping rest of stuff on GPU?

no-connections avatar Mar 20 '23 16:03 no-connections

I'm with the same problem, during the KSampler my GPU use in the max 1,2GB but in the VAEDecode it just jump to 3,2GB and crash due missing memory, already tried with --novram, my gpu has 4gb

I managed to force CPU in VAE with a small change here, what I did here is:
in sd.py file, we have those lines

class VAE:
    def __init__(self, ckpt_path=None, scale_factor=0.18215, device=None, config=None):
        if config is None:
            #default SD1.x/SD2.x VAE parameters
            ddconfig = {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0}
            self.first_stage_model = AutoencoderKL(ddconfig, {'target': 'torch.nn.Identity'}, 4, monitor="val/rec_loss", ckpt_path=ckpt_path)
        else:
            self.first_stage_model = AutoencoderKL(**(config['params']), ckpt_path=ckpt_path)
        self.first_stage_model = self.first_stage_model.eval()
        self.scale_factor = scale_factor
        if device is None:
            device = model_management.get_torch_device()
        self.device = device

change to

class VAE:
    def __init__(self, ckpt_path=None, scale_factor=0.18215, device=None, config=None):
        if config is None:
            #default SD1.x/SD2.x VAE parameters
            ddconfig = {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0}
            self.first_stage_model = AutoencoderKL(ddconfig, {'target': 'torch.nn.Identity'}, 4, monitor="val/rec_loss", ckpt_path=ckpt_path)
        else:
            self.first_stage_model = AutoencoderKL(**(config['params']), ckpt_path=ckpt_path)
        self.first_stage_model = self.first_stage_model.eval()
        self.scale_factor = scale_factor
        self.device = 'cpu'

This will break the comfy because the xformers that still enabled (no idea of how disable only in VAE), maybe overkill but by disabling xformers with --disable-xformers --use-split-cross-attention param made the trick, and now I can render images in a higer resolution than the usual without crash. The problem with what I did is since I disabled the xformers, we lose a lot of speed and memory usage optimizations, so, it will be very slower than the usual to process your query, but still better than everything in cpu aparentally

marcussacana avatar Mar 21 '23 19:03 marcussacana

After some tests I was able to keep the xfomers enabled and still use CPU in VAE, this is how: in comfy/ldm/modules/diffusionmodules/model.py replace the make_attn function with this code:


def make_attn(in_channels, attn_type="vanilla", attn_kwargs=None):
    assert attn_type in ["vanilla", "vanilla-xformers", "memory-efficient-cross-attn", "vanilla-non-efficient", "linear", "none"], f'attn_type {attn_type} unknown'
    if model_management.xformers_enabled() and attn_type == "vanilla":
        attn_type = "vanilla-xformers"
    if model_management.pytorch_attention_enabled() and attn_type == "vanilla":
        attn_type = "vanilla-pytorch"
    print(f"making attention of type '{attn_type}' with {in_channels} in_channels")
    if attn_type == "vanilla" or attn_type == "vanilla-non-efficient":
        assert attn_kwargs is None
        return AttnBlock(in_channels)
    elif attn_type == "vanilla-xformers":
        print(f"building MemoryEfficientAttnBlock with {in_channels} in_channels...")
        return MemoryEfficientAttnBlock(in_channels)
    elif attn_type == "vanilla-pytorch":
        return MemoryEfficientAttnBlockPytorch(in_channels)
    elif type == "memory-efficient-cross-attn":
        attn_kwargs["query_dim"] = in_channels
        return MemoryEfficientCrossAttentionWrapper(**attn_kwargs)
    elif attn_type == "none":
        return nn.Identity(in_channels)
    else:
        raise NotImplementedError()

and in the comfy/sd.py file, look for the VAE constructor function, replace with this:

class VAE:
    def __init__(self, ckpt_path=None, scale_factor=0.18215, device=None, config=None):
        if config is None:
            #default SD1.x/SD2.x VAE parameters
            ddconfig = {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0, 'attn_type': 'vanilla-non-efficient'}
            self.first_stage_model = AutoencoderKL(ddconfig, {'target': 'torch.nn.Identity'}, 4, monitor="val/rec_loss", ckpt_path=ckpt_path)
        else:
            self.first_stage_model = AutoencoderKL(**(config['params']), ckpt_path=ckpt_path)
        self.first_stage_model = self.first_stage_model.eval()
        self.scale_factor = scale_factor
        self.device = 'cpu'

to be more clear, the changes are basically this...

in model.py:

changed this line from assert attn_type in ["vanilla", "vanilla-xformers", "memory-efficient-cross-attn", "linear", "none"], f'attn_type {attn_type} unknown' to assert attn_type in ["vanilla", "vanilla-xformers", "memory-efficient-cross-attn", "vanilla-non-efficient", "linear", "none"], f'attn_type {attn_type} unknown'

and in that vanilla if: if attn_type == "vanilla": changed like that: if attn_type == "vanilla" or attn_type == "vanilla-non-efficient":

in sd.py:

changed like this in this line (of course in the VAE constructor function): ddconfig = {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0} became this: ddconfig = {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0, 'attn_type': 'vanilla-non-efficient'}

and here:

if device is None:
        device = model_management.get_torch_device()
self.device = device

became this:

self.device = 'cpu'

Performance

I did some tests here the results are:

Comfui in lowvram mode (GTX 1650) with 1088x1088 image + xformers

KSampler: 2:31m VAE: 5s appr.

Comfui in lowvram mode (GTX 1650) with 1152x1152 image + xformers

KSampler: 2:55m VAE: Out of Memory Crash


Comfui in novram mode (GTX 1650) with 1088x1088 image + xformers

KSampler: 2:40m VAE: 5s

Comfui in novram mode (GTX 1650) with 1152x1152 image + xformers

KSampler: 2:40m VAE: Out of Memory Crash


Comfui in novram mode (GTX 1650) with 1088x1088 image + xformers + CPU VAE

KSampler: 2:40m VAE: 1:02m

Comfui in novram mode (GTX 1650) with 1152x1152 image + xformers + CPU VAE

KSampler: 2:40m VAE: 1:10m

Comfui in novram mode (GTX 1650) with 1920x1920 image + xformers + CPU VAE

KSampler: 15:35m VAE: 3:53m (Up 9gb of RAM used)


So, after that tests, no doubt VAE in CPU slow down the rendering but increase a lot the max rendering resolution of confyui. I didn't tried a resolution bigger than 1920x1920 since it already take 9GB of ram and with just few more MBs my computer RAM isn't enought, so is near my max resolution here.

This made me think VAE needs new methods to optimize the RAM/VRAM usage, my GPU has only 4GB of vram and KSample can 'render' a 1920x1920 image without problems, but the VAE can't even render a 1152x1152 image with 4GB of vram, so, more or less, it became a type of bottleneck in the project.

the new node VAEDecodeTiled that are under test might solve this issue as well

marcussacana avatar Mar 22 '23 08:03 marcussacana

I made your changes and watched my VRAM usage with a workflow I was experimenting with without changing any other Vram settings and it outright used around 1.3 GB less Vram in the final steps with 2 controlnets enabled. I didn't time it, but seems for me the speed decrease is neglible.

Very nice!

fambaa avatar Mar 22 '23 18:03 fambaa

I vote for a command line option to enable this CPU decoding, it would really make 4K possible on a 3060 with 12GB, I crash at the Decoding stage :)

FraYoshi avatar Mar 23 '23 00:03 FraYoshi

I vote for a command line option to enable this CPU decoding, it would really make 4K possible on a 3060 with 12GB, I crash at the Decoding stage :)

I would like more a node to switch to CPU when needed... But the VAEDecoderTiled can solve the memory problem so I think this will not be implemented.

marcussacana avatar Mar 23 '23 01:03 marcussacana

As of today VAEDecode tries to decode normally and if it fails because of OOM it will retry with the tiled decoding that I also improved, it's not perfect yet but it should be seamless.

comfyanonymous avatar Mar 23 '23 01:03 comfyanonymous

@comfyanonymous seems working, but is possible to make a tiled version of the VAEEncoderForInpaint too?

marcussacana avatar Mar 25 '23 18:03 marcussacana

any chance for a CPU VAE decode node? while the VAE tiled decode works allowing higher resolution, using the cpu is just faster on this VAE decode... and it is quite bothersome to save latents and then reopen in cpu mode to decode them

the modifications made by marcussacana above don't work anymore, giving a "cuda expected and got cpu"

MaddyAurora avatar Jul 10 '23 00:07 MaddyAurora

@comfyanonymous

I have never had this failover work. It always OOM and require a restart of ComfyUI. If there was a toggle to just replace every VAE with tiled it would be less painful as the regular VAE pretty much always crash. Like --VAETiled as an argument. or --VAECPU to force behavior globally regardless of what workflow is loaded. Since I run into that problem whenever I load someones workflow and forget to replace every VAE with tiled and end up crashing more often than not.

no-connections avatar Jul 12 '23 18:07 no-connections

The tiled vae not always have the same result as the no tiled vae, like in my case, I am using a node to create tiled textures adding padding_mode = "circular" to the Conv2d layers, so the tiled vae method breaks the seamless feature of the padding_mode = "circular", I suggest a cmd arg like --vae-oom-retry [tiled,cpu] with tiled been the default, or something like that, right now I hard coded to try with cpu instead of tiled, but it would be good to have a argument for that.

jn-jairo avatar Sep 15 '23 02:09 jn-jairo

As mentioned in #2409, the requested feature was added with the argument --cpu-vae.

ky-tt avatar Jul 12 '24 22:07 ky-tt