Slowness with lora and control net (any control net model) for Flux
Expected Behavior
No high VRAM usage, and no extreme slowness with controlnet
Actual Behavior
Technical details : latest version of comfy ui, 3090, pytorch 2.1 cuda 12.1, windows.
I currently use Comfy UI in production and this is really blocking because using multiple more than 32 rank lora on top of flux is extremely VRAM hungry, and using any control net with ControlNetApplyAdvanced or even the one for SD3/Hyuandit is extremely slow.
Comfy UI is currently not stable with my current configuration (windows is not a choice).
In case using GGUF doesnt help at all since speed is 1.8 times slower and control net support is not working for all models.
Steps to Reproduce
Technical details : latest version of comfy ui, 3090, pytorch 2.1 cuda 12.1, windows.
Just use any control net, or high rank lora stacked.
Debug Logs
None
Other
None
Update your pytorch to at least 2.3 and your nvidia drivers to the latest.
Update your pytorch to at least 2.3 and your nvidia drivers to the latest.
I update to 2.3.1 and latest driver and the exact same issue occur with very high vram usage (8Gb for 64 rank lora) and control net extremely slow generation.
Are you sure? try downloading the latest standalone package from the readme.
Downloaded the standalone package, updated everything, and now its 34.99s/it... previously it was 1.24it/s. Mine system has 4070ti, 64gb DDR5 ram, core i7-14700k. How do I revert back to previous version?
Also noticed that loading more than one lora file will increase the generation time by 10 to 15 seconds per iteration.
2080ti 11gb windows made sure pytorch and nvidia were updated reinstalled from readme and same result as my updated install
I've had this issue for a little while now. I wish I knew what update changed it, but didn't keep up with it (comfy version number isn't in plain sight) If I had to guess i'd say it was within the last 3 updates. Didn't happen with this last one and didn't happen with the one before it. Was before that. Same story as the rest I had no problems generate images with flux and a lora, but now having 1 lora kills it. Roughly 14m for a single image. It does work, just very very slow.
I looked at the issues and saw it was reported 4 or 5 days ago so just been patient. As a dev myself I recognize this "Are you sure?" So just letting you know there is others that have followed your steps and this odd issue persists.
I'm experiencing the same issue. The official Controlnet workflow runs fine with some VRAM to spare. However, as soon as I add an 18M Lora to the workflow, the VRAM immediately explodes.
Allocation on device 0 would exceed allowed memory. (out of memory) Currently allocated : 22.47 GiB Requested : 72.00 MiB Device limit : 23.99 GiB Free (according to CUDA): 0 bytes PyTorch limit (set by user-supplied memory fraction) : 17179869184.00 GiB
Can you check if things have improved on the latest commit?
I have the same problem, I can't use two Lora at the same time, it slows down a lot with 4070 ti Super. With 1 Lora it is also slower than normal I'm using flux.dev16
Can you check if things have improved on the latest commit?
Its the exact same issue, on latest commit you did for fp8 lora, I used your 2.0 version, The exact same issue, maybe even a little slower.
Can you check if things have improved on the latest commit?
Its the exact same issue, on latest commit you did for fp8 lora, I used your 2.0 version, The exact same issue, maybe even a little slower.
The issue you're experiencing is related to shared memory. The best solution is to configure your GPU to not use VRAM as shared memory. If this isn't possible, you should use the --disable-smart-memory option to minimize VRAM usage. The next option to consider is the --reserve-memory option.
I was having horrendous slowdown issues with the previous portable release, sometimes with multiple minutes per iteration which made batch running impossible. However updating to the latest release v0.2.2 with the update to pytorch 124 has me back down to 2.6secs/iter.
7950x, 64GB DDR5, RTX 3080 10GB
Might fix others issues too?
I ran into the same problem a few days ago. wanted to test a self-trained LORA by generating 50 different images. Here´s what helped me to increase the speed and make it more predictable. It still isn´t perfect, but it works for a bunch of images until it gets stuck again.
Deactivate cuda_malloc Memory Allocation to find under Settings > Server. Using the KSampler Advanced, setting its end_at_step value to default 1000 steps, but keeping the steps at the desired value (i usually use ~20). In Case you are using the DWPoseestimator, it helps, adding the Save Pose Keypoints node. Keep Controlnet images in reasonable sizes (max dimensions under 600px work fine for me)
All together i could decrease rendering time down to between 1.5 and 4.5 s/it. It still varies wildly, but at least it works almost stable somehow.
Update: Further improvements i could establish by using the ViT-L-14-TEXT-detail-improved-hiT-GmP-TE-only-HF Clip Model & flan_t5_xxl_TE-only_Q8_0.gguf instead of the standard Flux dev versions. It improved speed by decreasing time down to 1,5s/it and overall quality of fines details and prompt stability, especially limbs. It also saves a few MB VRAM...
Update 2: Switching to FLUX.1-dev-ControlNet-Union-Pro-2.0 Controlnet Model and integrating nunchaku-flux.1-dev (https://huggingface.co/nunchaku-tech/nunchaku-flux.1-dev) Diffuser, which basically a SVDQuant quantized INT4 FLUX.1-dev model, solved the problem once and for all. I´m now at ~1s/it, which is tremendously good for a RTX3080, while batch generation is stable and reliable. It even saves roughly 2GB VRAM on my system.
RTX3080 Laptop, 16GB VRAM, Comfyui v.0.3.67, Pytorch 2.8.0+cu129, Flux1 dev fp8, Flux1 dev controlnet union
