ComfyUI
                                
                                 ComfyUI copied to clipboard
                                
                                    ComfyUI copied to clipboard
                            
                            
                            
                        [Feature request] Option to use CPU for VAEEncode/Decode
VAE in --novram uses about double the VRAM than the actual SD calculations, meaning it's the only reason for the out of VRAM crashes. Using the experimental tiled VAE is really a game changer since it allowed very high resolutions to be achieved even on lower VRAM cards. But tiles isn't without issues with seams having to happen somewhere. Would it be possible to mix and match CPU/GPU nodes and run VAE on the CPU while keeping rest of stuff on GPU?
I'm with the same problem, during the KSampler my GPU use in the max 1,2GB but in the VAEDecode it just jump to 3,2GB and crash due missing memory, already tried with --novram, my gpu has 4gb
I managed to force CPU in VAE with a small change here, what I did here is:
in sd.py file, we have those lines
class VAE:
    def __init__(self, ckpt_path=None, scale_factor=0.18215, device=None, config=None):
        if config is None:
            #default SD1.x/SD2.x VAE parameters
            ddconfig = {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0}
            self.first_stage_model = AutoencoderKL(ddconfig, {'target': 'torch.nn.Identity'}, 4, monitor="val/rec_loss", ckpt_path=ckpt_path)
        else:
            self.first_stage_model = AutoencoderKL(**(config['params']), ckpt_path=ckpt_path)
        self.first_stage_model = self.first_stage_model.eval()
        self.scale_factor = scale_factor
        if device is None:
            device = model_management.get_torch_device()
        self.device = device
change to
class VAE:
    def __init__(self, ckpt_path=None, scale_factor=0.18215, device=None, config=None):
        if config is None:
            #default SD1.x/SD2.x VAE parameters
            ddconfig = {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0}
            self.first_stage_model = AutoencoderKL(ddconfig, {'target': 'torch.nn.Identity'}, 4, monitor="val/rec_loss", ckpt_path=ckpt_path)
        else:
            self.first_stage_model = AutoencoderKL(**(config['params']), ckpt_path=ckpt_path)
        self.first_stage_model = self.first_stage_model.eval()
        self.scale_factor = scale_factor
        self.device = 'cpu'
This will break the comfy because the xformers that still enabled (no idea of how disable only in VAE),
maybe overkill but by disabling xformers with --disable-xformers --use-split-cross-attention param made the trick, and now I can render images in a higer resolution than the usual without crash.
The problem with what I did is since I disabled the xformers, we lose a lot of speed and memory usage optimizations, so, it will be very slower than the usual to process your query, but still better than everything in cpu aparentally
After some tests I was able to keep the xfomers enabled and still use CPU in VAE, this is how:
in comfy/ldm/modules/diffusionmodules/model.py replace the make_attn function with this code:
def make_attn(in_channels, attn_type="vanilla", attn_kwargs=None):
    assert attn_type in ["vanilla", "vanilla-xformers", "memory-efficient-cross-attn", "vanilla-non-efficient", "linear", "none"], f'attn_type {attn_type} unknown'
    if model_management.xformers_enabled() and attn_type == "vanilla":
        attn_type = "vanilla-xformers"
    if model_management.pytorch_attention_enabled() and attn_type == "vanilla":
        attn_type = "vanilla-pytorch"
    print(f"making attention of type '{attn_type}' with {in_channels} in_channels")
    if attn_type == "vanilla" or attn_type == "vanilla-non-efficient":
        assert attn_kwargs is None
        return AttnBlock(in_channels)
    elif attn_type == "vanilla-xformers":
        print(f"building MemoryEfficientAttnBlock with {in_channels} in_channels...")
        return MemoryEfficientAttnBlock(in_channels)
    elif attn_type == "vanilla-pytorch":
        return MemoryEfficientAttnBlockPytorch(in_channels)
    elif type == "memory-efficient-cross-attn":
        attn_kwargs["query_dim"] = in_channels
        return MemoryEfficientCrossAttentionWrapper(**attn_kwargs)
    elif attn_type == "none":
        return nn.Identity(in_channels)
    else:
        raise NotImplementedError()
and in the comfy/sd.py file, look for the VAE constructor function, replace with this:
class VAE:
    def __init__(self, ckpt_path=None, scale_factor=0.18215, device=None, config=None):
        if config is None:
            #default SD1.x/SD2.x VAE parameters
            ddconfig = {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0, 'attn_type': 'vanilla-non-efficient'}
            self.first_stage_model = AutoencoderKL(ddconfig, {'target': 'torch.nn.Identity'}, 4, monitor="val/rec_loss", ckpt_path=ckpt_path)
        else:
            self.first_stage_model = AutoencoderKL(**(config['params']), ckpt_path=ckpt_path)
        self.first_stage_model = self.first_stage_model.eval()
        self.scale_factor = scale_factor
        self.device = 'cpu'
to be more clear, the changes are basically this...
in model.py:
changed this line from
assert attn_type in ["vanilla", "vanilla-xformers", "memory-efficient-cross-attn", "linear", "none"], f'attn_type {attn_type} unknown'
to
assert attn_type in ["vanilla", "vanilla-xformers", "memory-efficient-cross-attn", "vanilla-non-efficient", "linear", "none"], f'attn_type {attn_type} unknown'
and in that vanilla if:
if attn_type == "vanilla":
changed like that:
if attn_type == "vanilla" or attn_type == "vanilla-non-efficient":
in sd.py:
changed like this in this line (of course in the VAE constructor function):
ddconfig = {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0}
became this:
ddconfig = {'double_z': True, 'z_channels': 4, 'resolution': 256, 'in_channels': 3, 'out_ch': 3, 'ch': 128, 'ch_mult': [1, 2, 4, 4], 'num_res_blocks': 2, 'attn_resolutions': [], 'dropout': 0.0, 'attn_type': 'vanilla-non-efficient'}
and here:
if device is None:
        device = model_management.get_torch_device()
self.device = device
became this:
self.device = 'cpu'
Performance
I did some tests here the results are:
Comfui in lowvram mode (GTX 1650) with 1088x1088 image + xformers
KSampler: 2:31m VAE: 5s appr.
Comfui in lowvram mode (GTX 1650) with 1152x1152 image + xformers
KSampler: 2:55m VAE: Out of Memory Crash
Comfui in novram mode (GTX 1650) with 1088x1088 image + xformers
KSampler: 2:40m VAE: 5s
Comfui in novram mode (GTX 1650) with 1152x1152 image + xformers
KSampler: 2:40m VAE: Out of Memory Crash
Comfui in novram mode (GTX 1650) with 1088x1088 image + xformers + CPU VAE
KSampler: 2:40m VAE: 1:02m
Comfui in novram mode (GTX 1650) with 1152x1152 image + xformers + CPU VAE
KSampler: 2:40m VAE: 1:10m
Comfui in novram mode (GTX 1650) with 1920x1920 image + xformers + CPU VAE
KSampler: 15:35m VAE: 3:53m (Up 9gb of RAM used)
So, after that tests, no doubt VAE in CPU slow down the rendering but increase a lot the max rendering resolution of confyui. I didn't tried a resolution bigger than 1920x1920 since it already take 9GB of ram and with just few more MBs my computer RAM isn't enought, so is near my max resolution here.
This made me think VAE needs new methods to optimize the RAM/VRAM usage, my GPU has only 4GB of vram and KSample can 'render' a 1920x1920 image without problems, but the VAE can't even render a 1152x1152 image with 4GB of vram, so, more or less, it became a type of bottleneck in the project.
the new node VAEDecodeTiled that are under test might solve this issue as well
I made your changes and watched my VRAM usage with a workflow I was experimenting with without changing any other Vram settings and it outright used around 1.3 GB less Vram in the final steps with 2 controlnets enabled. I didn't time it, but seems for me the speed decrease is neglible.
Very nice!
I vote for a command line option to enable this CPU decoding, it would really make 4K possible on a 3060 with 12GB, I crash at the Decoding stage :)
I vote for a command line option to enable this CPU decoding, it would really make 4K possible on a 3060 with 12GB, I crash at the Decoding stage :)
I would like more a node to switch to CPU when needed... But the VAEDecoderTiled can solve the memory problem so I think this will not be implemented.
As of today VAEDecode tries to decode normally and if it fails because of OOM it will retry with the tiled decoding that I also improved, it's not perfect yet but it should be seamless.
@comfyanonymous seems working, but is possible to make a tiled version of the VAEEncoderForInpaint too?
any chance for a CPU VAE decode node? while the VAE tiled decode works allowing higher resolution, using the cpu is just faster on this VAE decode... and it is quite bothersome to save latents and then reopen in cpu mode to decode them
the modifications made by marcussacana above don't work anymore, giving a "cuda expected and got cpu"
@comfyanonymous
I have never had this failover work. It always OOM and require a restart of ComfyUI. If there was a toggle to just replace every VAE with tiled it would be less painful as the regular VAE pretty much always crash. Like --VAETiled as an argument. or --VAECPU to force behavior globally regardless of what workflow is loaded. Since I run into that problem whenever I load someones workflow and forget to replace every VAE with tiled and end up crashing more often than not.
The tiled vae not always have the same result as the no tiled vae, like in my case, I am using a node to create tiled textures adding padding_mode = "circular" to the Conv2d layers, so the tiled vae method breaks the seamless feature of the padding_mode = "circular", I suggest a cmd arg like --vae-oom-retry [tiled,cpu] with tiled been the default, or something like that, right now I hard coded to try with cpu instead of tiled, but it would be good to have a argument for that.
As mentioned in  #2409, the requested feature was added with the argument --cpu-vae.