ComfyUI MultiGPU Work Units For Accelerated Sampling

Overview

This PR adds support for MultiGPU acceleration via 'work unit' splitting - by default, conditioning is treated as work units. Any model that uses more than a single conditioning can be sped up via MultiGPU Work Units - positive+negative, multiple positive/masked conditioning, etc. The code is extendible to allow extensions to implement their own work units; as proof of concept, I have implemented AnimateDiff-Evolved contexts to behave as work units.

As long as there is a heavy bottleneck on the GPU, there will be a noticeable performance improvement. If the GPU is only lightly loaded (i.e RTX 4090 sampling a single 512x512 SD1.5 image), the overhead to split and combine work units will result in performance loss compared to using just one GPU.

The MultiGPU Work Units node can be placed in (almost) any existing workflow. When only one device is found, the node does effectively nothing, so workflows making use of the node will stay compatible between single and multi-GPU setups:

The feature works best when work splitting is symmetrical (GPUs are the same/have roughly the same performance), with the slowest GPU acting as the limiter. For asymmetrical setups, the MultiGPU Options node can be used to inform load balancing code about the relative performance of the MultiGPU setup:

Nvidia (CUDA): Tested, works ✅. AMD (ROCm): Untested, will validate soon AMD (DirectML): Untested, Intel (Arc XPU): Tested, does not work; pytorch would need to properly support multiple XPU devices ❌.

Implementation Details

Based on max_gpus and the available amount of devices, the main ModelPatcher is cloned and relevant properties (like model) are deepcloned after the values are unloaded. MultiGPU clones are stored on the ModelPatcher's additional_models under key multigpu. During sampling, the deepcloned ModelPatchers are re-cloned with the values from the main ModelPatcher, with any additional_models kept consistent. To avoid unnecessarily deepcloning models, currently_loaded_models from comfy.model_management are checked for a matching deepcloned model, in which case they are (soft) cloned and made to match the main ModelPatcher.

When native conds are used as the work units, _calc_cond_batch calls and returns _calc_cond_batch_multigpu to avoid potential regression in performance if single-GPU code was to be refactored. In the future, this can be revisited to reuse the same code while carefully comparing performance for various models. No processes are created, only python threads; while GIL does limit CPU performance, the GPU being the bottleneck makes diffusion I/O-bound rather than CPU-bound. This vastly improves compatibility with existing code.

Since deepcloning requires that the base model is 'clean', comfy.model_management has received a unload_model_and_clones function to unload only specific models and their clones.

The --cuda-device startup argument has been refactored to accept a string rather than an int, allowing multiple ids to be provided while not breaking any existing usage: This can be used to not only limit ComfyUI's visibility to a subset of devices per instance, but also their order (the first id is treated as device:0, second as device:1, etc.)

Performance (will add more examples soon)

Wan 1.3B t2v: 1.85x uplift for 2 RTX 4090s vs 1 RTX 4090.

Wan 14B t2v: 1.89x uplift for 2 RTX 4090s vs 1 RTX 4090

Mar 04 '25 06:03 Kosinkadink

Brother, first of all, I have to thank you—this is a great PR! I have four T4 GPUs, but each T4 only has 16GB of VRAM. When I run my workflow, I often encounter out-of-memory (OOM) issues on a single GPU, while the other three GPUs remain unused because ComfyUI does not support distributed processing. Can your MultiGPU feature allow me to utilize the other three GPUs as well? I don’t mind sacrificing some performance as long as I can run my workflow without hitting OOM errors.

Mar 05 '25 13:03 hhaishen

@hhaishen The Work Units feature accelerates sampling, meaning each device gets its own copy of the model weights to run - it does not add any functionality to distribute portions of weights to run on other devices.

There will be a PR in the future that addresses that use case.

Mar 09 '25 06:03 Kosinkadink

Hello, can this branch achieve the goal of running a task in parallel using multiple GPUs?

Mar 10 '25 08:03 wang153723482

I have implemented dual A100 80G inference on WAN2.1, but the other six A100 can only support VRAM and do not participate in inference calculations

Mar 10 '25 09:03 thestar23333

I have implemented dual A100 80G inference on WAN2.1, but the other six A100 can only support VRAM and do not participate in inference calculations

Oh, that's amazing! How is it done?

Mar 10 '25 10:03 wang153723482

我在 WAN2.1 上实现了双 A100 80G 推理，但其他 6 个 A100 只能支持 VRAM，不参与推理计算

哦，太神奇了！它是如何完成的？

I have implemented dual A100 80G inference on WAN2.1, but the other six A100 can only support VRAM and do not participate in inference calculations

Oh, that's amazing! How is it done?

Switching to his branch to access multigpu nodes, but I have tried multiple times and found that he can only run inference on up to 2 GPUs simultaneously. More GPUs do not participate in inference acceleration,

Mar 10 '25 10:03 thestar23333

will it work with hunyuanVideo ? it has 2 multigpu implementations as well ,xdit and skyreel torch distributed with skyreel implementation providing near 2x speed up . thank you for the great work .

Mar 10 '25 10:03 mr-lab

Thanks for the big thing!

I have tested SDXL(Euler a, 28steps) 832x1216 and 1216x1856, and WAN2.1 I2V 480P(from ComfyUI Example but fp8).

	1x4090	1x3090	2x3090
SDXL 832x1216	7.61it/s, 3.90 sec	3.77it/s, 7.76 sec	6.78it/s, 4.46 sec
SDXL 1216x1563	3.22it/s, 9.20 sec	1.50it/s, 19.45 sec	2.79it/s, 10.80 sec
WAN I2V 480P fp8	5.70s/it, 116.18 sec	10.80s/it, 219.19 sec	5.63s/it, 115.70 sec

4090: power limit 300W / 3090: power limit 270W, PCIe 4.0x8

In case of 4x3090, model is loaded on 4 GPUs but only 2 GPUs are operated.

Mar 10 '25 10:03 bedovyy

Thanks for the big thing!

I have tested SDXL 832x1216 and 1216x1856, and WAN2.1 I2V 480P(from ComfyUI Example but fp8).

1x4090 1x3090 2x3090 SDXL 832x1216 7.61it/s, 3.90 sec 3.77it/s, 7.76 sec 6.78it/s, 4.46 sec SDXL 1216x1563 3.22it/s, 9.20 sec 1.50it/s, 19.45 sec 2.79it/s, 10.80 sec WAN I2V 480P fp8 5.70s/it, 116.18 sec 10.80s/it, 219.19 sec 5.63s/it, 115.70 sec

4090: power limit 300W / 3090: power limit 270W

In case of 4x3090, only model is loaded and only 2 GPUs are operated.

have you passed --cuda-device 0,1,2,3 ? , if so maybe the gpus are not loaded 100% "If the GPU is only lightly loaded (i.e RTX 4090 sampling a single 512x512 SD1.5 image), the overhead to split and combine work units will result in performance loss compared to using just one GPU." again if so try a more intensive task .

Mar 10 '25 10:03 mr-lab

Thanks for the big thing! I have tested SDXL 832x1216 and 1216x1856, and WAN2.1 I2V 480P(from ComfyUI Example but fp8). 1x4090 1x3090 2x3090 SDXL 832x1216 7.61it/s, 3.90 sec 3.77it/s, 7.76 sec 6.78it/s, 4.46 sec SDXL 1216x1563 3.22it/s, 9.20 sec 1.50it/s, 19.45 sec 2.79it/s, 10.80 sec WAN I2V 480P fp8 5.70s/it, 116.18 sec 10.80s/it, 219.19 sec 5.63s/it, 115.70 sec

4090: power limit 300W / 3090: power limit 270W

In case of 4x3090, only model is loaded and only 2 GPUs are operated.

have you passed --cuda-device 0,1,2,3 ? , if so maybe the gpus are not loaded 100% "If the GPU is only lightly loaded (i.e RTX 4090 sampling a single 512x512 SD1.5 image), the overhead to split and combine work units will result in performance loss compared to using just one GPU." again if so try a more intensive task .

I have passed in both way 'CUDA_VISIBLE_DEVICES=2,3,0,1 python main.py' or 'python main.py --cude-device 2,3,0,1'

Here's screenshot of nvtop.

I will keep trying and at least figure out what's my problem.

Mar 10 '25 10:03 bedovyy

谢谢你的大事！我已经测试了 SDXL 832x1216 和 1216x1856，以及 WAN2.1 I2V 480P（来自 ComfyUI 示例，但 fp8）。 1x4090 1x3090 2x3090 SDXL 832x1216 7.61it/s，3.90 秒 3.77it/s，7.76 秒 6.78it/s，4.46 秒 SDXL 1216x1563 3.22it/s，9.20 秒 1.50it/s，19.45 秒 2.79it/s，10.80 秒 WAN I2V 480P fp8 5.70s/it，116.18 秒 10.80s/it，219.19 秒 5.63s/it，115.70 秒

4090：功率限制 300W / 3090：功率限制 270W

在 4x3090 的情况下，仅加载模型，并且仅运行 2 个 GPU。

您是否通过了 --cuda-device 0,1,2,3 ？，如果是这样，则可能是 GPU 没有 100% 加载“如果 GPU 仅轻加载（即 RTX 4090 对单个 512x512 SD1.5 图像进行采样），与仅使用一个 GPU 相比，拆分和组合工作单元的开销将导致性能损失。如果是这样，请再次尝试更密集的任务 .

Thanks for the big thing! I have tested SDXL 832x1216 and 1216x1856, and WAN2.1 I2V 480P(from ComfyUI Example but fp8). 1x4090 1x3090 2x3090 SDXL 832x1216 7.61it/s, 3.90 sec 3.77it/s, 7.76 sec 6.78it/s, 4.46 sec SDXL 1216x1563 3.22it/s, 9.20 sec 1.50it/s, 19.45 sec 2.79it/s, 10.80 sec WAN I2V 480P fp8 5.70s/it, 116.18 sec 10.80s/it, 219.19 sec 5.63s/it, 115.70 sec

4090: power limit 300W / 3090: power limit 270W

In case of 4x3090, only model is loaded and only 2 GPUs are operated.

have you passed --cuda-device 0,1,2,3 ? , if so maybe the gpus are not loaded 100% "If the GPU is only lightly loaded (i.e RTX 4090 sampling a single 512x512 SD1.5 image), the overhead to split and combine work units will result in performance loss compared to using just one GPU." again if so try a more intensive task .

I used -- CUDE device 2,3,0,1 to allocate GPUs and found that it is not related to the device allocation values in the nodes. He still only recognizes CUDE device allocation

Mar 10 '25 11:03 thestar23333

Thanks for the big thing! I have tested SDXL 832x1216 and 1216x1856, and WAN2.1 I2V 480P(from ComfyUI Example but fp8). 1x4090 1x3090 2x3090 SDXL 832x1216 7.61it/s, 3.90 sec 3.77it/s, 7.76 sec 6.78it/s, 4.46 sec SDXL 1216x1563 3.22it/s, 9.20 sec 1.50it/s, 19.45 sec 2.79it/s, 10.80 sec WAN I2V 480P fp8 5.70s/it, 116.18 sec 10.80s/it, 219.19 sec 5.63s/it, 115.70 sec

4090: power limit 300W / 3090: power limit 270W

In case of 4x3090, only model is loaded and only 2 GPUs are operated.

have you passed --cuda-device 0,1,2,3 ? , if so maybe the gpus are not loaded 100% "If the GPU is only lightly loaded (i.e RTX 4090 sampling a single 512x512 SD1.5 image), the overhead to split and combine work units will result in performance loss compared to using just one GPU." again if so try a more intensive task .

I have passed in both way 'CUDA_VISIBLE_DEVICES=2,3,0,1 python main.py' or 'python main.py --cude-device 2,3,0,1'

Here's screenshot of nvtop.

I will keep trying and at least figure out what's my problem.

Hello, can you share a workflow for using multiple GPUs

Mar 10 '25 12:03 wang153723482

okay I read the code and the author's article, Only 2 GPUs are working because it has conditioning-based work units. diffusion models usually have 2 conditionings (positive and negative), so works are divided by 2.

If I combine conditionings (using ConditioningCombine node), I could use more than 2 GPUs. (However, it will decrease speed as much as the conditioning I added, so it is meaningless.)

@wang153723482 , you just need to put MultiGPU Work Units between Load Checkpoint and KSampler.

Mar 10 '25 13:03 bedovyy

I have implemented AnimateDiff-Evolved contexts to behave as work units.

you mean this is an example of a working work unit? https://github.com/Kosinkadink/ComfyUI-AnimateDiff-Evolved/blob/main/animatediff/context.py

also could you provide a hint on how it might work with this sampler logic? https://github.com/kijai/ComfyUI-WanVideoWrapper/blob/main/nodes.py

Mar 10 '25 14:03 kunibald413

okay I read the code and the author's article, Only 2 GPUs are working because it has conditioning-based work units. diffusion models usually have 2 conditionings (positive and negative), so works are divided by 2.

If I combine conditionings (using ConditioningCombine node), I could use more than 2 GPUs. (However, it will decrease speed as much as the conditioning I added, so it is meaningless.)

@wang153723482 , you just need to put MultiGPU Work Units between Load Checkpoint and KSampler. ![image]

Okay, thank you very much. I will continue to pay attention and hope to achieve more GPU parallel processing for a task.

Mar 11 '25 12:03 wang153723482

Great work! Would this allow for the duplication of a single GPU?

Say that I'm running SD15 inference on a 4090, can I share the remaining VRAM among multiple instances on the same GPU?

Mar 11 '25 17:03 BuffMcBigHuge

@comfyanonymous hello senior, can we expect this to be merged at some point, if so would there be a rough eta? thank you for your time!

Mar 13 '25 13:03 kunibald413

Being able to tell which GPUs to use would be nice for systems with different GPU models. I've tried to use the MultiGPU Options nodes without any success.

On my system, due to physical and logical layout, the 2 4090's are CUDA IDs 2,4 while 3 P40's are 0,1,3 (was hoping to use the P40's for the other models, t5, vae, etc), so annoyingly cuda:0,1 end up being an extremely slow GPUs for anything modern. ComfyUI appears to force CUDA_DEVICE_ORDER=PCI_BUS_ID, as CUDA_DEVICE_ORDER=FASTEST_FIRST would possibly resolve some of the issues.

Mar 16 '25 12:03 without-ordinary

cuda 12.4, torch2.6, python 2.11 sageattention 2

simple wan workflow with 2 gpu runs into this when having sageattention, 1 gpu is fine, so 2 gpu and default attention is also fine

 File "/vidgen/ComfyUI/comfy/ldm/wan/model.py", line 73, in forward
    x = optimized_attention(
        ^^^^^^^^^^^^^^^^^^^^
  File "/vidgen/ComfyUI/comfy/ldm/modules/attention.py", line 492, in attention_sage
    out = sageattn(q, k, v, attn_mask=mask, is_causal=False, tensor_layout=tensor_layout)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/sageattention/core.py", line 105, in sageattn
    q_int8, q_scale, k_int8, k_scale = per_block_int8(q, k, sm_scale=sm_scale, tensor_layout=tensor_layout)
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/sageattention/quant_per_block.py", line 63, in per_block_int8
    quant_per_block_int8_kernel[grid](
  File "/opt/conda/lib/python3.11/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/triton/runtime/jit.py", line 653, in run
    kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
  File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 444, in __call__
    self.launch(*args, **kwargs)
ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)

edit1: maybe need to do torch.cuda.set_device(device) in the multi gpu code?

Mar 16 '25 20:03 kunibald413

@without-ordinary You can set the order (and/or) subset of GPUs via --cuda-device startup argument. So you could do --cuda-device 2,4,0,1,3 to make sure your 4090s will appear first. On startup, the device names and their assignments will be in the logs as well so you can be sure of their order without needing to run anything to check.

Mar 16 '25 21:03 Kosinkadink

@without-ordinary You can set the order (and/or) subset of GPUs via --cuda-device startup argument. So you could do --cuda-device 2,4,0,1,3 to make sure your 4090s will appear first.

Wait, the order passed to CUDA_VISIBLE_DEVICES should reorder the devices? Or is that something you've added within Comfy? AFAIK, CUDA_VISIBLE_DEVICES order does not matter or affects anything, only what shows.

I had actually used --cuda-device 2,4,0,1,3 for all but the last few tests and was still seeing the first PCI_BUS_ID card getting used first.

On startup, the device names and their assignments will be in the logs as well so you can be sure of their order without needing to run anything to check.

The order shown at start up has never been affected by anything I've tried, either yesterday on this fork or in the past (ie. CUDA_DEVICE_ORDER=FASTEST_FIRST). I had assumed comfy forced CUDA_DEVICE_ORDER=PCI_BUS_ID somehow and I could not track down how or why.

Device: cuda:0 Tesla P40 : cudaMallocAsync                                                                  
Device: cuda:1 Tesla P40 : cudaMallocAsync                                                                  
Device: cuda:2 NVIDIA GeForce RTX 4090 : cudaMallocAsync                                                    
Device: cuda:3 Tesla P40 : cudaMallocAsync                                                                  
Device: cuda:4 NVIDIA GeForce RTX 4090 : cudaMallocAsync

Mar 17 '25 04:03 without-ordinary

@without-ordinary Yes, the order should work. Not sure if the cuda driver works differently on linux as I have only Windows installed on my multigpu rig currently, but it works as intended there. By default, the driver picks up my GPUs like this:

If I add --cuda-device 2,3,0,1, I get this:

And by running a workflow with max_gpus=2, I can confirm it is using the 3090s instead of the 4090s based on the sampling speed, GPU utilization, and VRAM usage.

Mar 17 '25 19:03 Kosinkadink

@kunibald413 Nah that call is not needed (it's deprecated), from your error message it looks like triton may be unhappy. Could you link me your workflow so I can investigate?

Mar 17 '25 19:03 Kosinkadink

您可以通过 startup 参数设置 GPU 的顺序（和/或）子集。因此，您可以确保您的 4090 首先出现。--cuda-device``--cuda-device 2,4,0,1,3

等等，传递给 CUDA_VISIBLE_DEVICES 的订单是否应该重新订购设备？或者这是您在 Comfy 中添加的内容？AFAIK，CUDA_VISIBLE_DEVICES顺序无关紧要，也不会影响任何东西，只影响显示的内容。

除了最后几次测试之外，我实际上已经使用了所有测试，并且仍然看到第一张卡首先被使用。--cuda-device 2,4,0,1,3``PCI_BUS_ID

启动时，设备名称及其分配也将在日志中，因此您可以确定它们的顺序，而无需运行任何检查。

启动时显示的顺序从未受到我尝试过的任何事情的影响，无论是昨天在这个分叉上还是过去（即）。我以为舒适是被迫的，但我无法追踪如何或为什么。CUDA_DEVICE_ORDER=FASTEST_FIRST``CUDA_DEVICE_ORDER=PCI_BUS_ID
Device: cuda:0 Tesla P40 : cudaMallocAsync                                                                  
Device: cuda:1 Tesla P40 : cudaMallocAsync                                                                  
Device: cuda:2 NVIDIA GeForce RTX 4090 : cudaMallocAsync                                                    
Device: cuda:3 Tesla P40 : cudaMallocAsync                                                                  
Device: cuda:4 NVIDIA GeForce RTX 4090 : cudaMallocAsync

Add GPU ID to the end of the startup code to select hardware python main.py --cuda-device 0,1

Mar 18 '25 01:03 thestar23333

@Kosinkadink workflow

2 x 4090 torch==2.6.0+cu124

sage attention 2 compiled form source as they say in the readme: https://github.com/thu-ml/SageAttention

started comfy with --cuda-device=0,1 --use-sage-attention

i2v_wan_multigpu_unit.json (is fine on single gpu or using 2 gpu with default attention)

thanks for checking

edit: was an install issue on my end, i'm sorry for wasting your time., thanks for checking

Mar 18 '25 06:03 kunibald413

confirmed working basic SDXL T2I on 2x Arc B580, using torch2.6.1 + intel extension 2.6.10.

	it/s	time taken
1xB580	3.68	8.92
2xB580	6.58	5.64

it is about 58% boosted.

And I ran WAN2.1 I2V 480P from ComfyUI Examples, but change dtype to fp8e5m2, I had to use --disable-ipex-optimize option and the below envs. (I don't know what's going on...)

# Configure oneAPI environment variables.
source /opt/intel/oneapi/setvars.sh

# Recommended Environment Variables for optimal performance
export USE_XETLA=OFF
export SYCL_CACHE_PERSISTENT=1
# [optional] under most circumstances, the following environment variable may improve performance, but sometimes this may also cause performance degradation
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1

	s/it	time taken
1xB580	15.44	338.39
2xB580	8.39	178.14

it's about 90% boosted.

Mar 18 '25 08:03 bedovyy

Just discovered this pull request, it seems amazing! If it's not much trouble, have you tested Wan with a 4090 + 3090? Thanks!

Mar 19 '25 18:03 Pagani83

This was mentioned to me when I was talking about this in a discord server - could we get a mode that simply splits batches across gpus instead of pos/neg? It's awful getting multi GPU cobbled together with custom nodes and multiple comfy instances for lighter models like sdxl. To just have a drop in node to split a batch across multiple gpus with no overhead would be FANTASTIC.

So for a batch of 6, 3 on one gpu, 3 on another. Or whatever it divides out to be based on relative speed. Perfect scaling, no overhead on lighter models.

Mar 19 '25 19:03 TheUnamusedFox

@TheUnamusedFox I made contexts work as work units in AnimateDiff-Evolved. You can test it out by simply pulling the latest ADE and adding these nodes:

The context_length is the maximum amount of latents to use in a context, and then the code will split the work into an even amount of contexts between max_gpus. Currently, the logic in that code will select contexts as the work units only if the 'scaling' is more even than when using conds as work units, but I will add a toggle somewhere in ADE to force context work units all the time.

Contexts is something I will eventually try to bring into core ComfyUI, there are a few more things I want to test out in ADE before I consider working on a PR for core ComfyUI, like selecting the dim of the tensors to subdivide (ADE currently uses the first dimension).

Mar 19 '25 21:03 Kosinkadink

Thank you very much, I will check that out right now!

@kunibald413 How did you resolve that issue? I'm running into the same with with a fresh comfy, using sageattention 1 or 2. I'm on wsl 2 Ubuntu.

Mar 19 '25 21:03 TheUnamusedFox

ComfyUI ComfyUI copied to clipboard

MultiGPU Work Units For Accelerated Sampling

Overview

Implementation Details

Performance (will add more examples soon)

ComfyUI
ComfyUI copied to clipboard