ComfyUI feat: support node smart offload, reduce peak VRAM/RAM usage

For the purpose of decoupling, or to avoid reloading the model every time a node is executed, node developers tend to separate the model loading as an individual node, so that the execution speed can benefit from the node cache. Although ComfyUI implements the internal model management method model_management.load_models_gpu, it is unrealistic to expect all custom nodes to adopt this approach given the variety of model architectures and developers.

In the current implementation, the outputs of all nodes are always referenced during the workflow execution. This prevents some larger models or tensors from being effectively garbage-collected, resulting in CUDA memory overflow.

Let's take the following workflow as an example:

Before the execution of node 24, the models loaded by 🔎Yoloworld Model Loader and 🔎ESAM Model Loader should have been released, and this portion of GPU memory could have been returned to the subsequent memory-intensive KSampler, instead of causing a CUDA out-of-memory error in later steps.

20240613191248

In the example above, AllocateVRAM is used to simulate the GPU memory allocation scenario, and it is a simple custom node implementation.

class AnyType(str):
    def __ne__(self, __value: object) -> bool:
        return False
any_type = AnyType("*")

class AllocateVRAM:
    @classmethod
    def INPUT_TYPES(s):
        return {
            "required": { 
                "anything": (any_type,),
                "size": ("FLOAT", {"min": 0, "max": 1024, "step": 0.01, "default": 1}),
            }
        }
    RETURN_TYPES = (any_type, "TENSOR")
    RETURN_NAMES = ("anything", "tensor")
    FUNCTION = "main"
    CATEGORY = "util"
    def main(self, anything, size):
        num_elements = 1_073_741_824 // 4
        tensor = torch.randn(int(float(num_elements) * size), device='cuda')
        return (anything, tensor)

It simulates subsequent, more complex workflows.

This PR aims to automatically release unreferenced node outputs, helping to reduce peak VRAM/RAM usage during execution and mitigate out-of-memory issues. The following modes are supported:

0: Means disabling this feature;
1: Only release outputs that were never referenced;
2: Release all currently unreferenced outputs.

When setting --node-smart-offload-level to 2 in the launch arguments, the example workflow above runs well on an A10 GPU with 22GB of memory.

This implementation has good compatibility and I believe many workflows would benefit from this.

Jun 13 '24 11:06 storyicon

For the purpose of decoupling, or to avoid reloading the model every time a node is executed, node developers tend to separate the model loading as an individual node, so that the execution speed can benefit from the node cache. Although ComfyUI implements the internal model management method model_management.load_models_gpu, it is unrealistic to expect all custom nodes to adopt this approach given the variety of model architectures and developers.

In the current implementation, the outputs of all nodes are always referenced during the workflow execution. This prevents some larger models or tensors from being effectively garbage-collected, resulting in CUDA memory overflow.

Let's take the following workflow as an example:

Before the execution of node 24, the models loaded by 🔎Yoloworld Model Loader and 🔎ESAM Model Loader should have been released, and this portion of GPU memory could have been returned to the subsequent memory-intensive KSampler, instead of causing a CUDA out-of-memory error in later steps.

In the example above, AllocateVRAM is used to simulate the GPU memory allocation scenario, and it is a simple custom node implementation.
class AnyType(str):
    def __ne__(self, __value: object) -> bool:
        return False
any_type = AnyType("*")

class AllocateVRAM:
    @classmethod
    def INPUT_TYPES(s):
        return {
            "required": { 
                "anything": (any_type,),
                "size": ("FLOAT", {"min": 0, "max": 1024, "step": 0.01, "default": 1}),
            }
        }
    RETURN_TYPES = (any_type, "TENSOR")
    RETURN_NAMES = ("anything", "tensor")
    FUNCTION = "main"
    CATEGORY = "util"
    def main(self, anything, size):
        num_elements = 1_073_741_824 // 4
        tensor = torch.randn(int(float(num_elements) * size), device='cuda')
        return (anything, tensor)
It simulates subsequent, more complex workflows.

This PR aims to automatically release unreferenced node outputs, helping to reduce peak VRAM/RAM usage during execution and mitigate out-of-memory issues. The following modes are supported:

0: Means disabling this feature;

1: Only release outputs that were never referenced;

2: Release all currently unreferenced outputs.

When setting --node-smart-offload-level to 2 in the launch arguments, the example workflow above runs well on an A10 GPU with 22GB of memory.

This implementation has good compatibility and I believe many workflows would benefit from this.

Do we need that allocatevram node , if so where can we get it ?

Tried this with a workflow (always same seeds, same configs) that generates & hiresfix & face restore & hand restore and in the end didn't do much noticeble difference , normally over 400 sec so even a 20-30 sec difference would be good to get , but only like 5 secs max and that could be within normal.

Jun 14 '24 17:06 patientx

AllocateVRAM is a test node, and there is no need to merge it into the default node list, which is why I have provided its implementation code in the PR description. This PR aims to reduce peak VRAM/RAM usage, thereby alleviating OOM issues. It is not intended to accelerate workflow execution, so the lack of significant changes in execution time that you observed is expected. @patientx

Jun 17 '24 03:06 storyicon

I do want this as a node is it in the node library?

Jul 16 '24 10:07 BrechtCorbeel

@guill is there a valuable idea here?

Jul 21 '24 21:07 doctorpangloss

I definitely think there's value in being able to dump cached outputs in the middle of graph execution (for users on low-spec machines). That behavior would have to be re-implemented under the forward execution changes though (PR #2666).

Because the forward execution PR allows the dynamic creation of graph edges during execution (which is what allows loops to work), it's currently impossible to know with certainty when the output of a node is "done" being used so that we can dump it. There are two ways we could add this behavior post execution-inversion:

In this edge case (where we have already dumped a node's output and then a new edge is added from that node's output), re-execute the node. This would increase execution time, but keep RAM/VRAM usage low.
Restrict what edges a node expansion is allowed to add to the graph. Specifically, we could say that new output connections are only allowed to be added to nodes that already connect to the node performing the expansion. This would work fine for existing use-cases (loops, components, and lazy evaluation), but might restrict things in the future.

Jul 21 '24 22:07 guill

can we just disable output dumping if there is any loop behavior? aka enable eager unloading as long as the graph is a real static DAG?

Jul 21 '24 23:07 doctorpangloss

Currently, there's no way to know ahead of time which nodes may expand and which won't -- and I'm not totally sure that we would want to disable this feature based on that anyway. If people are enabling this feature, it's likely because they simply don't have the RAM/VRAM to handle keeping the models in memory. If the alternative is "execution fails with an out of memory error", loading the model from disk multiple times may be preferable anyway.

Jul 21 '24 23:07 guill

seems to be a good addition

Sep 26 '24 17:09 ethanfel