ComfyUI "Pinned_memory" feature is amazing!!!

Hi ComfyUI team! This is not an issue but a positive feedback about recent changes in ComfyUI code. In particular about --fast pinned_memory but also about --async-offload.

My system is: Windows 11 CPU: AMD Ryzen 7 5700X 8 core RAM: 64 GB GPU: RTX 5060 Ti 16GB on PCIe Gen3 @ x8 (Gen3 it's maximum on my AM4 motherboard, @ x8 is maximum on RTX 5060 series) Resizable BAR: enabled in UEFI and in nvidia Profile Inspector

Let's look at this Task Manager picture:

In the past when GPU Memory was getting to any tiny amount higher than the actual VRAM of my card (15.x-16 GB) I would not get an OOM right away (thanks to default Nvidia driver behavior in Windows to fall back on system RAM) but any processing from that point onward would be slowing down to a frozen jelly speed, taking forever to finish a VAE decoding for example. And that in spite of the fact that Task Manager would show 100% GPU utilization. But you could tell that it not exactly so looking at the temperature, it would show something in 44-47 C range, meaning that GPU was running at idle temperatures. Now look at the screenshot again an observe the temperature: 70 C. Amazing, I don't know what kind of sorcery is this but is very welcome! And I get the subjective feeling that my GPU is more effectively run even at times and steps in the workflow when the actual hardware GPU VRAM is not exceeded.

I am planning to propose a similar enhancement to Llama.cpp project, if that would be possible, since Llama.cpp has exactly the same behavior like ComfyUI had in the past. A loaded LLM model with a 16.1 GB VRAM utilization would take forever to process any prompt forcing me to use only less demanding models that would occupy only 8 - 12 GB VRAM.

Also I would like to thank about the changes in --async-offload. The performance and behavior on my system is finally better with it enabled. In the past I had issues with slow speed and / or noise output in various workflows so I kind of excluded --async-offload permanently from my options to enhance performance on my system. But now the situation seems much better. I didn't make any measurements but with my usual workflows it seems to at least not cause any downside. Since I enabled both options simultaneously it's hard to tell which impact has each one separately, but I can tell that the first run of a Wan 2.2 I2V workflow (using unquantized fp16 safetensors) that before the latest ComfyUI updates was taking about 19 minutes, now it takes a little over 16 minutes. Probably the difference is: faster loading from disk, better RAM and VRAM utilization, better GPU time utilization. At least, lacking objective benchmarks, is what is feeling like just looking at Task Manager.

Again, many thanks to ComfyUI team!

Oct 30 '25 16:10 jovan2009

The pinned_memory by accident fixes the slowdown when memory estimation is a bit off on Windows, but we will work out that issue soon as well since it can still happen with pinned_memory (just less of a slowdown)! The current workaround when you experience that is to add --reserve-vram 1 or higher just to prevent the spillover into shared; it's the amount of VRAM the memory management system will try to leave free in addition to what it thinks the models need.

We will also see if we can speed up --async-offload even more. Thank you for taking your time to write this!

Oct 30 '25 21:10 Kosinkadink

When I add --fast pinned_memory as a launch argument, Comfy refuses to load the .bat, claiming that pinned_memory is an unrecognised argument. Is this something that was added today and requires an update?

Oct 30 '25 22:10 PhoenixKnight87

yes, it got merged yesterday - you need to be on latest of master.

Oct 30 '25 23:10 Kosinkadink

Ah I was on the nightly build. I've just switched to the stable (showing as 0.3.67) but the .bat still won't load with pinned_memory as an argument

Oct 30 '25 23:10 PhoenixKnight87

The current workaround when you experience that is to add --reserve-vram 1 or higher just to prevent the spillover into shared; it's the amount of VRAM the memory management system will try to leave free in addition to what it thinks the models need.

Yes, I played with --reserve-vram in the past, trying with 1 and 2. It definitely improved things but not with every workflow. In desperation I resorted to disable GPU acceleration in Chrome and in some cases to close the frontend window altogether after I start the processing, keeping only the server as a feedback about how it's going with the prompt and reopening after the execution.

Oct 30 '25 23:10 jovan2009

because of you i tried new params thinking may be get some boost, however many exceptions and oom then come . I have to revert to a version a month ago

Oct 31 '25 09:10 zwukong

because of you i tried new params thinking may be get some boost, however many exceptions and oom then come . I have to revert to a version a month ago

Going back in time might be not the best solution. Maybe it's worth mentioning that I generally try to use latest iterations of every dependency in the ecosystem. Latest Nvidia driver, latest win 11 update but also latest nightly 2.10.0 torch+cuda 13. Also wherever I can I build from latest git my own wheels, so I run my own builds for xformers, flash attention 2.8.3 and sageattention 2.2.0, all built with latest CUDA toolkit 13.0.2. Edit: also I wiped from existence long time ago Microsoft Defender Antivirus using https://github.com/ionuttbara/windows-defender-remover.git . It was very satisfying 😄

Oct 31 '25 09:10 jovan2009

may be you are right ,i will try to use a newest packaged version

Oct 31 '25 11:10 zwukong

I was getting cuda OOMs at first when trying this out, though it seemed to only happen when using GGUF models. Base models seemed to work alright.

Oct 31 '25 19:10 thrnz

i think it is the pytorch's problem, 2.8 uses more vram. i updated before and oom, so i had to revert to 2.6

Nov 01 '25 04:11 zwukong

The most efficient offload way i have seen is in nunchaku , when their model runs ,vram is full and a lot of offloads are in shared vram.

The speed is not slowed by shared vram,while most comfyui workflows will become extremely slow when models are in shared vram

Nov 01 '25 05:11 zwukong

Sorry if this is a dumb question, but is --async-offload a replacement for --cuda-malloc? I noticed that when I use both --cuda-malloc and --async-offload at the same time, the Comfy logs only show CudaMallocAsync

pytorch version: 2.9.0+cu128
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 3080 : cudaMallocAsync
working around nvidia conv3d memory bug.
Using sage attention

Nov 03 '25 12:11 jprsyt5

Sorry if this is a dumb question, but is --async-offload a replacement for --cuda-malloc? I noticed that when I use both --cuda-malloc and --async-offload at the same time, the Comfy logs only show CudaMallocAsync

pytorch version: 2.9.0+cu128 Enabled fp16 accumulation. Set vram state to: NORMAL_VRAM Device: cuda:0 NVIDIA GeForce RTX 3080 : cudaMallocAsync working around nvidia conv3d memory bug. Using sage attention

I'm not sure, the only thing that comes to my mind involving what you are talking about is this commit: 5b80addafd24bda5b2f9f7a35e32dbd40823c3fd. Meaning that if you use --fast without specifying any option it would enable all its options, autotune among them. Which would turn off cuda malloc, according to that commit. I know you are talking about the other way around, cuda malloc disabling --async-offload (which has nothing to do with --fast) but that's all what I could think of.

Edit: I use --fast fp16_accumulation fp8_matrix_mult pinned_memory --async-offload, I don't specify --cuda-malloc but it still shows it's using cudaMallocAsync.

xformers version: 0.0.33+e98c69b.d20251102
Enabled fp16 accumulation.
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 5060 Ti : cudaMallocAsync
Using async weight offloading with 2 streams
working around nvidia conv3d memory bug.
Using sage attention
Python version: 3.13.9 (main, Oct 31 2025, 22:58:20) [MSC v.1944 64 bit (AMD64)]
ComfyUI version: 0.3.67
Initializing frontend: Comfy-Org/ComfyUI_frontend@latest, requesting version details from GitHub...
[Prompt Server] web root: C:\ComfyUI\web_custom_versions\Comfy-Org_ComfyUI_frontend\1.32.1
'resemble-perth' not found. Watermarking will be unavailable.
ColorMod: Ignoring node 'CV2TonemapDurand' due to cv2 edition/version
Using sage attention

Nov 03 '25 13:11 jovan2009

Pinned memory is now enabled by default? I just found this in the console: Enabled pinned memory 58940.0

Total VRAM 24564 MB, total RAM 130980 MB
pytorch version: 2.9.0+cu130
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.8.0+cu128 with CUDA 1208 (you have 2.9.0+cu130)
    Python  3.9.13 (you have 3.12.7)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4090 : cudaMallocAsync
Enabled pinned memory 58940.0
working around nvidia conv3d memory bug.
Using pytorch attention
Python version: 3.12.7 (tags/v3.12.7:0b05ead, Oct  1 2024, 03:06:41) [MSC v.1941 64 bit (AMD64)]
ComfyUI version: 0.3.68
ComfyUI frontend version: 1.28.8

Just updated from 2.7.0+cu126 and I think I instantly got a 30% increase in generation speed.

So, there is no need to use pinned_memory as a launch argument? How about --async-offload?

Thank you!

Nov 09 '25 05:11 adydeejay

Pinned memory is now enabled by default? I just found this in the console: Enabled pinned memory 58940.0

Total VRAM 24564 MB, total RAM 130980 MB
pytorch version: 2.9.0+cu130
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.8.0+cu128 with CUDA 1208 (you have 2.9.0+cu130)
    Python  3.9.13 (you have 3.12.7)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4090 : cudaMallocAsync
Enabled pinned memory 58940.0
working around nvidia conv3d memory bug.
Using pytorch attention
Python version: 3.12.7 (tags/v3.12.7:0b05ead, Oct  1 2024, 03:06:41) [MSC v.1941 64 bit (AMD64)]
ComfyUI version: 0.3.68
ComfyUI frontend version: 1.28.8

Just updated from 2.7.0+cu126 and I think I instantly got a 30% increase in generation speed.

So, there is no need to use pinned_memory as a launch argument? How about --async-offload?

Thank you!

Yes, pinned memory is now enabled by default, the logic is now reversed, you have the option to disable it in the case you have problems with it. --async-offload is still needed to be specified as argument. (Edit: is needed if you want to have async offload enabled, it is not needed /related to pinned memory).

Seeing that your xformers installation is non functional (incompatibile with your current torch) I would suggest to compile it from source, it's not very difficult.

Nov 09 '25 10:11 jovan2009

@jovan2009 Thank you very much for the answer. I already solved xformers. I had stopped compiling because I didn't know why I have pinned_memory enabled by default and I didn't know what to do with the async-offload argument. Now I understand. Thanks again for the clarifications.

Nov 09 '25 12:11 adydeejay

@jovan2009 Thank you very much for the answer. I already solved xformers. I had stopped compiling because I didn't know why I have pinned_memory enabled by default and I didn't know what to do with the async-offload argument. Now I understand. Thanks again for the clarifications.

@adydeejay Since you updated your torch to a cuda 13 version I would also recommend to try 2.10.0 nightly, it works well on my system and probably has some issues resolved when compared to 2.9.0.

Nov 09 '25 13:11 jovan2009

@jovan2009 Interesting. I also have xformers uninstalled because I was getting constant Comfy crashes with a hopper error and even disabling it via launch arguments didn't seem to help. Has this issue now been resolved so that manually compiling xformers is viable?

I'm also running PyTorch 2.9+cu130. I've tried nightly builds in the past and faced a lot of compatibility issues, though currently one of the main issues I'm facing even with stable 2.9 is that nothing I do seems to get Nunchaku working.

So you would suggest PyTorch 2.10 nightly +cu130 might be better to try here? Also, just to clarify, is it advisable to run the async offload launch argument? I'm not 100% on what exactly this is and what benefits or disadvantages it offers. From what has been said above, pinned memory is now loaded as standard without the need for launch arguments. I'm still trying to work out which launch arguments are genuinely recommended for my setup.

I am running the latest Comfy Portable on Windows 11, with an RTX5070 (12gb VRAM) and 64gb DDR4 RAM.

Nov 09 '25 15:11 PhoenixKnight87

@jovan2009 Interesting. I also have xformers uninstalled because I was getting constant Comfy crashes with a hopper error and even disabling it via launch arguments didn't seem to help. Has this issue now been resolved so that manually compiling xformers is viable?

I'm also running PyTorch 2.9+cu130. I've tried nightly builds in the past and faced a lot of compatibility issues, though currently one of the main issues I'm facing even with stable 2.9 is that nothing I do seems to get Nunchaku working.

So you would suggest PyTorch 2.10 nightly +cu130 might be better to try here? Also, just to clarify, is it advisable to run the async offload launch argument? I'm not 100% on what exactly this is and what benefits or disadvantages it offers. From what has been said above, pinned memory is now loaded as standard without the need for launch arguments. I'm still trying to work out which launch arguments are genuinely recommended for my setup.

I am running the latest Comfy Portable on Windows 11, with an RTX5070 (12gb VRAM) and 64gb DDR4 RAM.

@PhoenixKnight87 For me using this command from the xformers readme works: pip install -v --no-build-isolation -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers provided I have a suitable Cuda toolkit installed + MSVC and I disable the compiling of Flash Attention using SET XFORMERS_DISABLE_FLASH_ATTN=1 It compiles xformers on the fly and installs it over whatever xformers installation I had before. I have compiled separately Flash Attention 2.8.3 from it's latest git and apparently xformers acknowledges it and (hopefully) is using it for it's operations:

C:\Users\Sidef\Desktop>python -m xformers.info
C:\Python313\install\Lib\site-packages\torch\library.py:356: UserWarning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: flash_attn::_flash_attn_backward(Tensor dout, Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor(a6!)? dq, Tensor(a7!)? dk, Tensor(a8!)? dv, float dropout_p, float softmax_scale, bool causal, SymInt window_size_left, SymInt window_size_right, float softcap, Tensor? alibi_slopes, bool deterministic, Tensor? rng_state=None) -> Tensor
    registered at C:\Python313\install\Lib\site-packages\torch\_library\custom_ops.py:922
  dispatch key: ADInplaceOrView
  previous kernel: no debug info
       new kernel: registered at C:\Python313\install\Lib\site-packages\torch\_library\custom_ops.py:922 (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\pytorch\aten\src\ATen\core\dispatch\OperatorEntry.cpp:215.)
  self.m.impl(
xFormers 0.0.33+e98c69b.d20251108
memory_efficient_attention.ckF:                    unavailable
memory_efficient_attention.ckB:                    unavailable
memory_efficient_attention.ck_splitKF:             unavailable
memory_efficient_attention.cutlassF-pt:            available
memory_efficient_attention.cutlassB-pt:            available
memory_efficient_attention.cutlassF-blackwell:     unavailable
memory_efficient_attention.cutlassB-blackwell:     unavailable
[email protected]:             available
[email protected]:             available
[email protected]:             unavailable
[email protected]:             unavailable
[email protected]:     unavailable
memory_efficient_attention.triton_splitKF:         available
indexing.scaled_index_addF:                        available
indexing.scaled_index_addB:                        available
indexing.index_select:                             available
sp24.sparse24_sparsify_both_ways:                  available
sp24.sparse24_apply:                               available
sp24.sparse24_apply_dense_output:                  available
sp24._sparse24_gemm:                               available
[email protected]:                 available
[email protected]:                        available
swiglu.dual_gemm_silu:                             available
swiglu.gemm_fused_operand_sum:                     available
swiglu.fused.p.cpp:                                available
is_triton_available:                               True
pytorch.version:                                   2.10.0.dev20251109+cu130
pytorch.cuda:                                      available
gpu.compute_capability:                            12.0
gpu.name:                                          NVIDIA GeForce RTX 5060 Ti
dcgm_profiler:                                     unavailable
build.info:                                        available
build.cuda_version:                                1300
build.hip_version:                                 None
build.python_version:                              3.13.9
build.torch_version:                               2.10.0.dev20251108+cu130
build.env.TORCH_CUDA_ARCH_LIST:                    12.0
build.env.PYTORCH_ROCM_ARCH:                       None
build.env.XFORMERS_BUILD_TYPE:                     Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              -allow-unsupported-compiler -O3 --extra-device-vectorization --default-stream per-thread --use_fast_math --forward-unknown-opts --forward-slash-prefix-opts --ptxas-options=-O3,--allow-expensive-optimizations=true,--optimize-float-atomics,--register-usage-level=10
build.env.XFORMERS_PACKAGE_FROM:                   None
build.nvcc_version:                                13.0.88
source.privacy:                                    open source

Flash Attention 3 is still a bridge too far for me to compile it successfully, and probably will continue to be a long time since it is designed to run specifically on Hopper (from what I can understand with my non-developer brain).

Edit: about async offload, that you would have to test for yourself if is best on or off. In the past (weeks and months ago) it was giving me noise output in various workflows and I "blacklisted" it from my command options for a long time. Current iteration doesn't give such problems so it's always enabled, but I didn't make a comparative benchmark to measure it's positive or negative impact, all I know it is that it doesn't break my workflows (I run mostly WAN 2.2 I2V unquantized fp16 + 4 steps loras).

Edit 2: Also, sometimes I find useful to browse or search the ComfyUI's source code about particular settings. Although my skills regarding reading code are just a little bit higher than my cat's sometimes it serves me to figure out what to expect from a setting or other.

Edit 3: Also python main.py --help is my friend.

Edit 4: regarding the use of latest nightly torch, my philosophy is to use latest available version compiled on latest available cuda unless something breaks catastrophically and in an obvious way. I had such situations especially when 2.9.0 was latest nightly, not being able to start ComfyUI at all and filling my system disk with gigabytes of kernel error logs. But it's been a while since such thing happened. Looking at the amount of daily commits at the Pytorch repository I had to decide between waiting months until a stable version is released or skip that and trying everything as soon as possible with the risk that someday I will end with some failures, easily solvable reverting to previous day nightly build.

Edit 5: this "forever beta testing" behavior allows me to pinpoint quickly some regressions and maybe open issues at the right repository if it's the case is a problem I figured out the cause and/or the point in time when it started to happen.

Edit 6: you mentioned nunchaku but I am not a nunchaku user. I didn't managed to find the time to study and understand how it works. I had installed at some point multiple successive python wheels and the custom node but the node always complained that what wheel I have installed is not compatible and refused to load. I planned to compile myself nunchaku when I get around to try if it's possible for me to do but I had other things more urgent to do so I never really did.

Nov 09 '25 16:11 jovan2009

The pinned_memory by accident fixes the slowdown when memory estimation is a bit off on Windows, but we will work out that issue soon as well since it can still happen with pinned_memory (just less of a slowdown)! The current workaround when you experience that is to add --reserve-vram 1 or higher just to prevent the spillover into shared; it's the amount of VRAM the memory management system will try to leave free in addition to what it thinks the models need.

We will also see if we can speed up --async-offload even more. Thank you for taking your time to write this!

I achieved dynamic adjustment of reserved vram using custom_nodes，it can automatically reserve，XD. https://github.com/Windecay/ComfyUI-ReservedVRAM

Nov 10 '25 18:11 Windecay

The pinned_memory by accident fixes the slowdown when memory estimation is a bit off on Windows, but we will work out that issue soon as well since it can still happen with pinned_memory (just less of a slowdown)! The current workaround when you experience that is to add --reserve-vram 1 or higher just to prevent the spillover into shared; it's the amount of VRAM the memory management system will try to leave free in addition to what it thinks the models need. We will also see if we can speed up --async-offload even more. Thank you for taking your time to write this!

I achieved dynamic adjustment of reserved vram using custom_nodes，it can automatically reserve，XD. https://github.com/Windecay/ComfyUI-ReservedVRAM

@Windecay

It definitely sounds interesting and I intend to try it more when I have more time. I have to find an workflow that gives me that spillover effect 100% and lately, with the latest ComfyUI changes, it doesn't happen, at least in an obvious way. It used to happen especially with WanVideoWraper workflows and because of that reason I limited myself in the latest weeks/months to only use core ComfyUI nodes and I don't have at hand a recent WanVideoWraper workflow that I know for sure it gives me that problem. I need to find or make one with my current safetensors models that I have on disk and drive test your custom node properly. For now all I can say is that I tried it with the current workflow I'm working with and 150% I probably used your node in the wrong way. I set reserved VRAM to 0 instead of the default 0.6 and I set clean_gpu_before to "disabled". My logic was to test its "automatic" behavior. I got a ComfyUI server crash and an Nvidia driver crash and recover middle prompt execution. I don't have time to further test right now but I definitely will in the near future with probably more sane settings, once I read more carefully the readme at your repository.

Nov 10 '25 19:11 jovan2009

The pinned_memory by accident fixes the slowdown when memory estimation is a bit off on Windows, but we will work out that issue soon as well since it can still happen with pinned_memory (just less of a slowdown)! The current workaround when you experience that is to add --reserve-vram 1 or higher just to prevent the spillover into shared; it's the amount of VRAM the memory management system will try to leave free in addition to what it thinks the models need. We will also see if we can speed up --async-offload even more. Thank you for taking your time to write this!

I achieved dynamic adjustment of reserved vram using custom_nodes，it can automatically reserve，XD. https://github.com/Windecay/ComfyUI-ReservedVRAM

@Windecay

It definitely sounds interesting and I intend to try it more when I have more time. I have to find an workflow that gives me that spillover effect 100% and lately, with the latest ComfyUI changes, it doesn't happen, at least in an obvious way. It used to happen especially with WanVideoWraper workflows and because of that reason I limited myself in the latest weeks/months to only use core ComfyUI nodes and I don't have at hand a recent WanVideoWraper workflow that I know for sure it gives me that problem. I need to find or make one with my current safetensors models that I have on disk and drive test your custom node properly. For now all I can say is that I tried it with the current workflow I'm working with and 150% I probably used your node in the wrong way. I set reserved VRAM to 0 instead of the default 0.6 and I set clean_gpu_before to "disabled". My logic was to test its "automatic" behavior. I got a ComfyUI server crash and an Nvidia driver crash and recover middle prompt execution. I don't have time to further test right now but I definitely will in the near future with probably more sane settings, once I read more carefully the readme at your repository.

It simply dynamically modifies environment variables(EXTRA_RESERVED_VRAM). I incorporated the characteristics of random seeds to enable it to refresh itself. I don't have AMD or Tntel graphics cards to perfect it.I've applied it to almost all of my workflows and achieved good management. I may use Photoshop or DaVinci at work, but now I can keep them open while using ComfyUI."

Nov 10 '25 19:11 Windecay

The pinned_memory by accident fixes the slowdown when memory estimation is a bit off on Windows, but we will work out that issue soon as well since it can still happen with pinned_memory (just less of a slowdown)! The current workaround when you experience that is to add --reserve-vram 1 or higher just to prevent the spillover into shared; it's the amount of VRAM the memory management system will try to leave free in addition to what it thinks the models need. We will also see if we can speed up --async-offload even more. Thank you for taking your time to write this!

I achieved dynamic adjustment of reserved vram using custom_nodes，it can automatically reserve，XD. https://github.com/Windecay/ComfyUI-ReservedVRAM

@Windecay It definitely sounds interesting and I intend to try it more when I have more time. I have to find an workflow that gives me that spillover effect 100% and lately, with the latest ComfyUI changes, it doesn't happen, at least in an obvious way. It used to happen especially with WanVideoWraper workflows and because of that reason I limited myself in the latest weeks/months to only use core ComfyUI nodes and I don't have at hand a recent WanVideoWraper workflow that I know for sure it gives me that problem. I need to find or make one with my current safetensors models that I have on disk and drive test your custom node properly. For now all I can say is that I tried it with the current workflow I'm working with and 150% I probably used your node in the wrong way. I set reserved VRAM to 0 instead of the default 0.6 and I set clean_gpu_before to "disabled". My logic was to test its "automatic" behavior. I got a ComfyUI server crash and an Nvidia driver crash and recover middle prompt execution. I don't have time to further test right now but I definitely will in the near future with probably more sane settings, once I read more carefully the readme at your repository.

It simply dynamically modifies environment variables(EXTRA_RESERVED_VRAM). I incorporated the characteristics of random seeds to enable it to refresh itself. I don't have AMD or Tntel graphics cards to perfect it.I've applied it to almost all of my workflows and achieved good management. I may use Photoshop or DaVinci at work, but now I can keep them open while using ComfyUI."

It definitely sounds interesting, I am too currently refraining to open / use anything else too much during a prompt execution, I even disabled GPU acceleration in Chrome fearing that moment the VRAM would spill over. I will try later with more reserved VRAM instead of 0, which was clearly wrong.

Nov 10 '25 19:11 jovan2009

@jovan2009

Flash Attention 3 is still a bridge too far for me to compile it successfully, and probably will continue to be a long time since it is designed to run specifically on Hopper (from what I can understand with my non-developer brain).

Edit: about async offload, that you would have to test for yourself if is best on or off. In the past (weeks and months ago) it was giving me noise output in various workflows and I "blacklisted" it from my command options for a long time. Current iteration doesn't give such problems so it's always enabled, but I didn't make a comparative benchmark to measure it's positive or negative impact, all I know it is that it doesn't break my workflows (I run mostly WAN 2.2 I2V unquantized fp16 + 4 steps loras).

Edit 2: Also, sometimes I find useful to browse or search the ComfyUI's source code about particular settings. Although my skills regarding reading code are just a little bit higher than my cat's sometimes it serves me to figure out what to expect from a setting or other.

Edit 3: Also python main.py --help is my friend.

Edit 4: regarding the use of latest nightly torch, my philosophy is to use latest available version compiled on latest available cuda unless something breaks catastrophically and in an obvious way. I had such situations especially when 2.9.0 was latest nightly, not being able to start ComfyUI at all and filling my system disk with gigabytes of kernel error logs. But it's been a while since such thing happened. Looking at the amount of daily commits at the Pytorch repository I had to decide between waiting months until a stable version is released or skip that and trying everything as soon as possible with the risk that someday I will end with some failures, easily solvable reverting to previous day nightly build.

Edit 5: this "forever beta testing" behavior allows me to pinpoint quickly some regressions and maybe open issues at the right repository if it's the case is a problem I figured out the cause and/or the point in time when it started to happen.

Edit 6: you mentioned nunchaku but I am not a nunchaku user. I didn't managed to find the time to study and understand how it works. I had installed at some point multiple successive python wheels and the custom node but the node always complained that what wheel I have installed is not compatible and refused to load. I planned to compile myself nunchaku when I get around to try if it's possible for me to do but I had other things more urgent to do so I never really did.

any tips or links to some step by step tutorial on how to compile all that stuff? currently stuck on building latest pytorch. everything is mostly fine, except for -- USE_FLASH_ATTENTION : OFF can't figure it out how to compile with flash attention support.

Nov 12 '25 04:11 firsack

any tips or links to some step by step tutorial on how to compile all that stuff? currently stuck on building latest pytorch. everything is mostly fine, except for -- USE_FLASH_ATTENTION : OFF can't figure it out how to compile with flash attention support

@firsack You are trying to build Pytorch? Unless you meant something else and you wrote Pytorch by mistake, good luck with that. I tried to build Pytorch once at some point (a few months ago). After gathering and installing all that's required it compiled "successfully" (meaning without error) but I couldn't use it, it was not starting, complaining that it can't find the dependencies for one of its dlls. I thought: how is this possible? Since that dll was generated during compilation, if the dependencies were missing why the compilation didn't fail? Anyway, I came to the conclusion that Pytorch compilation is probably an art in itself and never tried again. I use only Pytorch nightly builds from the repository itself.

About the rest (xformers and flash attention), I will write here what I do with more details when I get to my computer, if you still need it.

Edit: I build xformers without flash attention support, building with flash attention seems impossible for me, it always ends in error. I simply write in command line SET XFORMERS_DISABLE_FLASH_ATTN=1 and hit enter before copy pasting the compile command from xformers readme I mentioned before. Or I include it in a .bat file that I run before compilation. Flash attention 2.8.3 I build separately from https://github.com/Dao-AILab/flash-attention.git

Edit 2: to be more clear the only way to build successfully xformers with most of its features enabled is by using that command from the readme which does the compilation on the fly (meaning it clones the repository in the temp folder, builds it there, installs the result and then deletes all the temporary files it created). If I git clone xformers repository myself in a folder I manage to compile it but the resulting wheel has most of the extensions disabled, practically is unusable. I don't know why such difference, I planned to ask about this at xformers repository but I never did so far.

Edit 3: in general when I try to build any project from GitHub and the results are not what I expect I try to find what are its environment variables. If they are not obviously specified in setup.py I search the whole repository about anything containing "env", that way I can find if there are any environment variables used or checked for anywhere in the whole code base. (that's where my non-developer non English native code reading skills got me so far, maybe there are other better ways but they elude me)

Edit 4: In the case I wasn't clear enough: I build xformers WITHOUT Flash Attention support (meaning it doesn't compile its included Flash Attention). But because I also build Flash Attention from its git xformers sees it and it acknowledges it when I run python -m xformers.info. And hopefully flash attention is really used when I run ComfyUI, I have no idea how to check if that's really the case.

Nov 12 '25 07:11 jovan2009

@jovan2009

You are trying to build Pytorch? Unless you meant something else and you wrote Pytorch by mistake, good luck with that. I tried to build Pytorch once at some point (a few months ago). After gathering and installing all that's required it compiled "successfully" (meaning without error) but I couldn't use it, it was not starting, complaining that it can't find the dependencies for one of its dlls. I thought: how is this possible? Since that dll was generated during compilation, if the dependencies were missing why the compilation didn't fail? Anyway, I came to the conclusion that Pytorch compilation is probably an art in itself and never tried again. I use only Pytorch nightly builds from the repository itself.

Yes, i'm trying to build pytorch. been trying to compile it, since i saw your pytorch.version: 2.10.0.dev20251109+cu130 I thought there was no other way to install the latest nightly version, besides compiling it yourself? I guess I was wrong and wasted 6 hours on a useless endeavor?

is there an easy way to install the latest 2.10.0.dev20251109+cu130 ?

Nov 12 '25 08:11 firsack

@firsack Yes, look here: https://pytorch.org/get-started/locally/

pip3 install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu130

I would add also torchaudio in that command, they all have to be "on the same page".

I feel your pain, when I compiled Pytorch myself I had to leave the computer running all night only to find out it was for nothing.

Nov 12 '25 08:11 jovan2009

@firsack BTW, since you are already seasoned in compiling "stuff" I would recommend also compiling Sageattention and SpargeAttn from this repository: https://github.com/woct0rdho

It's pretty easy (meaning I have no trouble compiling following the readme, Sageattention3 compiles too but it always fails its own test). The compile time is minutes, not hours.

From that repository I would like to be able to compile also triton but I couldn't manage so far to follow the readme and build successfully, it has a pretty complicated list of dependencies and the build fails very easy at any point.

The longest compile time among the projects I build myself is with Flash Attention 2.8.3, about an hour, give or take.

(Also llama.cpp takes a little bit longer, but is not related to ComfyUI, I mentioned it just as a detail)

Nov 12 '25 10:11 jovan2009

Guys, thank you for bringing additions and value to this topic. For anyone interested, I'm posting my installation log using: Windows 10/64-bit + NVIDIA GeForce RTX 4090 + ComfyUI v0.3.68. I hope it helps (at least some of you).

STEP 1 - CUDA + CUDNN Firstly, you need to install CUDA 13.0 and CUDNN 9.15 Then you need to check all the PATHs in Windows Environment Variables

CUDA_HOME: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.0

CUDA_PATH: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.0

Path: D:\ComfyUI_windows_portable\python_embeded\Scripts;D:\ComfyUI_windows_portable\python_embeded;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.0\bin\..\extras\CUPTI\lib64;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.0\bin;C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.0\libnvvp;C:\Program Files\NVIDIA\CUDNN\v9.15\bin;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\Nsight Compute 2024.3.2;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64\compiler;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0;C:\WINDOWS\System32\OpenSSH;C:\Program Files (x86)\AOMEI\AOMEI Backupper\7.3.1;C:\Program Files\Git\cmd;C:\Program Files\dotnet;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64\compiler;%USERPROFILE%.dotnet\tools;

CUDNN_PATH C:\Program Files\NVIDIA\CUDNN\v9.15

STEP 2 - triton-windows PS C:\WINDOWS\system32> d: PS D:> cd D:\ComfyUI_windows_portable\python_embeded PS D:\ComfyUI_windows_portable\python_embeded> pip install triton-windows Collecting triton-windows Downloading triton_windows-3.5.0.post21-cp312-cp312-win_amd64.whl.metadata (1.8 kB) Downloading triton_windows-3.5.0.post21-cp312-cp312-win_amd64.whl (47.3 MB) ---------------------------------------- 47.3/47.3 MB 54.7 MB/s 0:00:01 Installing collected packages: triton-windows Successfully installed triton-windows-3.5.0.post21

STEP 3 - sageattention This command is not working on my system: PS D:\ComfyUI_windows_portable\python_embeded> pip install sageattention==2.2.0 --no-build-isolation ...... lots of errors .... ERROR: Failed building wheel for sageattention Failed to build sageattention error: failed-wheel-build-for-install

so I downloaded: sageattention-2.2.0+cu130torch2.9.0.post3-cp39-abi3-win_amd64.whl from: https://huggingface.co/Wildminder/AI-windows-whl/tree/main

PS D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages> pip install sageattention-2.2.0+cu130torch2.9.0.post3-cp39-abi3-win_amd64.whl Processing d:\comfyui_windows_portable\python_embeded\lib\site-packages\sageattention-2.2.0+cu130torch2.9.0.post3-cp39-abi3-win_amd64.whl Installing collected packages: sageattention Attempting uninstall: sageattention Found existing installation: sageattention 1.0.6 Uninstalling sageattention-1.0.6: Successfully uninstalled sageattention-1.0.6 Successfully installed sageattention-2.2.0+cu130torch2.9.0.post3

STEP 4 - xformers Download: xformers-0.0.33+cu130torch2.9-cp39-abi3-win_amd64.whl https://huggingface.co/Wildminder/AI-windows-whl/tree/main

PS D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages> pip install xformers-0.0.33+cu130torch2.9-cp39-abi3-win_amd64.whl ...... Successfully uninstalled xformers-0.0.32.post2 Successfully installed xformers-0.0.33+c2407a6.d20251023

Starting ComfyUI I got: Prompt executed in 95.49 seconds

Still have an error: Could not find the bitsandbytes CUDA binary at WindowsPath('D:/ComfyUI_windows_portable/python_embeded/Lib/site-packages/bitsandbytes/libbitsandbytes_cuda130.dll')

STEP 5 - bitsandbytes https://github.com/bitsandbytes-foundation/bitsandbytes/releases/tag/continuous-release_main PS D:\ComfyUI_windows_portable\python_embeded\Lib\site-packages> pip install --force-reinstall https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-win_amd64.whl Collecting bitsandbytes==1.33.7rc0 ........... Successfully installed MarkupSafe-3.0.3 bitsandbytes-0.49.0.dev0 filelock-3.20.0 fsspec-2025.10.0 jinja2-3.1.6 mpmath-1.3.0 networkx-3.5 numpy-2.3.4 packaging-25.0 setuptools-80.9.0 sympy-1.14.0 torch-2.9.0 typing-extensions-4.15.0

Starting ComfyUI I got: AssertionError: Torch not compiled with CUDA enabled

STEP 6 - REINSTALL torch PS D:\ComfyUI_windows_portable\python_embeded> pip uninstall torch .... Successfully uninstalled torch-2.9.0

I don't recomand using: PS D:\ComfyUI_windows_portable\python_embeded> pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu130

You'll get : pytorch version: 2.10.0.dev20251111+cu130 WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: PyTorch 2.9.0+cu130 with CUDA 1300 (you have 2.10.0.dev20251111+cu130) + RuntimeError: operator torchvision::nms does not exist [W1112 11:25:21.000000000 AllocatorConfig.cpp:28] Warning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (function operator ())

Instead you can use: PS D:\ComfyUI_windows_portable\python_embeded> pip install torch==2.9.0+cu130 torchvision==0.24.0+cu130 torchaudio==2.9.0+cu130 --extra-index-url https://download.pytorch.org/whl/cu130 Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu130 Collecting torch==2.9.0+cu130 ....... Successfully installed torch-2.9.0+cu130 torchaudio-2.9.0+cu130 torchvision-0.24.0+cu130

STEP 7 - START COMFYUI

ComfyUI-Manager: installing dependencies done.

[2025-11-12 11:41:22.749] ** ComfyUI startup time: 2025-11-12 11:41:22.749 [2025-11-12 11:41:22.749] ** Platform: Windows [2025-11-12 11:41:22.749] ** Python version: 3.12.7 (tags/v3.12.7:0b05ead, Oct 1 2024, 03:06:41) [MSC v.1941 64 bit (AMD64)] [2025-11-12 11:41:22.753] ** Python executable: D:\ComfyUI_windows_portable\python_embeded\python.exe [2025-11-12 11:41:22.753] ** ComfyUI Path: D:\ComfyUI_windows_portable\ComfyUI [2025-11-12 11:41:22.753] ** ComfyUI Base Folder Path: D:\ComfyUI_windows_portable\ComfyUI [2025-11-12 11:41:22.753] ** User directory: D:\ComfyUI_windows_portable\ComfyUI\user [2025-11-12 11:41:22.753] ** ComfyUI-Manager config path: D:\ComfyUI_windows_portable\ComfyUI\user\default\ComfyUI-Manager\config.ini [2025-11-12 11:41:22.753] ** Log path: D:\ComfyUI_windows_portable\ComfyUI\user\comfyui.log

Prestartup times for custom nodes: [2025-11-12 11:41:25.304] 0.0 seconds: D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\rgthree-comfy [2025-11-12 11:41:25.304] 0.0 seconds: D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\comfyui-easy-use [2025-11-12 11:41:25.304] 5.7 seconds: D:\ComfyUI_windows_portable\ComfyUI\custom_nodes\comfyui-manager [2025-11-12 11:41:25.304] [2025-11-12 11:41:28.101] Checkpoint files will always be loaded safely. [2025-11-12 11:41:28.259] Total VRAM 24564 MB, total RAM 130980 MB [2025-11-12 11:41:28.259] pytorch version: 2.9.0+cu130 [2025-11-12 11:41:30.906] xformers version: 0.0.33+c2407a6.d20251023 [2025-11-12 11:41:30.906] Set vram state to: NORMAL_VRAM [2025-11-12 11:41:30.906] Device: cuda:0 NVIDIA GeForce RTX 4090 : cudaMallocAsync [2025-11-12 11:41:30.925] Enabled pinned memory 58940.0 [2025-11-12 11:41:30.957] working around nvidia conv3d memory bug. [2025-11-12 11:41:31.325] Using xformers attention [2025-11-12 11:41:34.296] Python version: 3.12.7 (tags/v3.12.7:0b05ead, Oct 1 2024, 03:06:41) [MSC v.1941 64 bit (AMD64)] [2025-11-12 11:41:34.296] ComfyUI version: 0.3.68 [2025-11-12 11:41:34.352] ComfyUI frontend version: 1.28.8

Starting ComfyUI I got (using Flux Dev + 3 loras + PulID + Reflux): Prompt executed in 91.32 seconds

STEP 8 - safety check I recommend scanning packages after any node update or installation. Many come with vulnerabilities. You can fix them using the minimum recommended version for each to avoid causing damage to ComfyUI, uninstall and install eg. : pip install aiohttp==3.12.14

PS D:\ComfyUI_windows_portable\python_embeded> safety check REPORT Safety v3.6.2 is scanning for Vulnerabilities... Scanning dependencies in your environment:

No known security vulnerabilities reported.

MORE INFO https://github.com/woct0rdho/triton-windows/releases/v3.0.0-windows.post1/ https://www.reddit.com/r/comfyui/comments/1lg4wjp/can_anyone_help_explain_this_error_sageattention/ https://github.com/bitsandbytes-foundation/bitsandbytes/releases/tag/continuous-release_main https://www.facebook.com/groups/comfyui/permalink/827995866639782/ https://www.reddit.com/r/comfyui/comments/1kzkh73/sageattention_upgrade_getting_a_not_a_supported/

Nov 12 '25 10:11 adydeejay

@adydeejay Thanks for your info. Just some observations after I quickly looked (in a very shallow way) at your comment:

Cudnn version is outdated (current version is 9.15, not 9.6)
Why build or use wheels from other repositories than the original projects? I build using current xformers git and for Sageattention I use the woct0rdho's git (it is not the "original" repository but it's the original fork that makes possible compiling in Windows).

Nov 12 '25 10:11 jovan2009

ComfyUI ComfyUI copied to clipboard

"Pinned_memory" feature is amazing!!!

ComfyUI-Manager: installing dependencies done.

No known security vulnerabilities reported.

ComfyUI
ComfyUI copied to clipboard