ComfyUI icon indicating copy to clipboard operation
ComfyUI copied to clipboard

Slowness with lora and control net (any control net model) for Flux

Open axel578 opened this issue 1 year ago • 11 comments

Expected Behavior

No high VRAM usage, and no extreme slowness with controlnet

Actual Behavior

Technical details : latest version of comfy ui, 3090, pytorch 2.1 cuda 12.1, windows.

I currently use Comfy UI in production and this is really blocking because using multiple more than 32 rank lora on top of flux is extremely VRAM hungry, and using any control net with ControlNetApplyAdvanced or even the one for SD3/Hyuandit is extremely slow.

Comfy UI is currently not stable with my current configuration (windows is not a choice).

In case using GGUF doesnt help at all since speed is 1.8 times slower and control net support is not working for all models.

Steps to Reproduce

Technical details : latest version of comfy ui, 3090, pytorch 2.1 cuda 12.1, windows.

Just use any control net, or high rank lora stacked.

Debug Logs

None

Other

None

axel578 avatar Sep 02 '24 20:09 axel578

Update your pytorch to at least 2.3 and your nvidia drivers to the latest.

comfyanonymous avatar Sep 02 '24 20:09 comfyanonymous

Update your pytorch to at least 2.3 and your nvidia drivers to the latest.

I update to 2.3.1 and latest driver and the exact same issue occur with very high vram usage (8Gb for 64 rank lora) and control net extremely slow generation.

axel578 avatar Sep 02 '24 22:09 axel578

Are you sure? try downloading the latest standalone package from the readme.

comfyanonymous avatar Sep 02 '24 22:09 comfyanonymous

Downloaded the standalone package, updated everything, and now its 34.99s/it... previously it was 1.24it/s. Mine system has 4070ti, 64gb DDR5 ram, core i7-14700k. How do I revert back to previous version?

Also noticed that loading more than one lora file will increase the generation time by 10 to 15 seconds per iteration.

vivek-kumar-poddar avatar Sep 02 '24 23:09 vivek-kumar-poddar

2080ti 11gb windows made sure pytorch and nvidia were updated reinstalled from readme and same result as my updated install

I've had this issue for a little while now. I wish I knew what update changed it, but didn't keep up with it (comfy version number isn't in plain sight) If I had to guess i'd say it was within the last 3 updates. Didn't happen with this last one and didn't happen with the one before it. Was before that. Same story as the rest I had no problems generate images with flux and a lora, but now having 1 lora kills it. Roughly 14m for a single image. It does work, just very very slow.

I looked at the issues and saw it was reported 4 or 5 days ago so just been patient. As a dev myself I recognize this "Are you sure?" So just letting you know there is others that have followed your steps and this odd issue persists.

JunesiPhone avatar Sep 03 '24 02:09 JunesiPhone

I'm experiencing the same issue. The official Controlnet workflow runs fine with some VRAM to spare. However, as soon as I add an 18M Lora to the workflow, the VRAM immediately explodes.

Allocation on device 0 would exceed allowed memory. (out of memory) Currently allocated : 22.47 GiB Requested : 72.00 MiB Device limit : 23.99 GiB Free (according to CUDA): 0 bytes PyTorch limit (set by user-supplied memory fraction) : 17179869184.00 GiB

op7418 avatar Sep 03 '24 03:09 op7418

Can you check if things have improved on the latest commit?

comfyanonymous avatar Sep 03 '24 06:09 comfyanonymous

I have the same problem, I can't use two Lora at the same time, it slows down a lot with 4070 ti Super. With 1 Lora it is also slower than normal I'm using flux.dev16

sabutay avatar Sep 03 '24 06:09 sabutay

Can you check if things have improved on the latest commit?

image

Its the exact same issue, on latest commit you did for fp8 lora, I used your 2.0 version, The exact same issue, maybe even a little slower.

axel578 avatar Sep 03 '24 09:09 axel578

Can you check if things have improved on the latest commit?

image

Its the exact same issue, on latest commit you did for fp8 lora, I used your 2.0 version, The exact same issue, maybe even a little slower.

The issue you're experiencing is related to shared memory. The best solution is to configure your GPU to not use VRAM as shared memory. If this isn't possible, you should use the --disable-smart-memory option to minimize VRAM usage. The next option to consider is the --reserve-memory option.

ltdrdata avatar Sep 04 '24 04:09 ltdrdata

I was having horrendous slowdown issues with the previous portable release, sometimes with multiple minutes per iteration which made batch running impossible. However updating to the latest release v0.2.2 with the update to pytorch 124 has me back down to 2.6secs/iter.

7950x, 64GB DDR5, RTX 3080 10GB

Might fix others issues too?

Stoobs avatar Sep 08 '24 09:09 Stoobs

I ran into the same problem a few days ago. wanted to test a self-trained LORA by generating 50 different images. Here´s what helped me to increase the speed and make it more predictable. It still isn´t perfect, but it works for a bunch of images until it gets stuck again.

Deactivate cuda_malloc Memory Allocation to find under Settings > Server. Using the KSampler Advanced, setting its end_at_step value to default 1000 steps, but keeping the steps at the desired value (i usually use ~20). In Case you are using the DWPoseestimator, it helps, adding the Save Pose Keypoints node. Keep Controlnet images in reasonable sizes (max dimensions under 600px work fine for me)

All together i could decrease rendering time down to between 1.5 and 4.5 s/it. It still varies wildly, but at least it works almost stable somehow.

Update: Further improvements i could establish by using the ViT-L-14-TEXT-detail-improved-hiT-GmP-TE-only-HF Clip Model & flan_t5_xxl_TE-only_Q8_0.gguf instead of the standard Flux dev versions. It improved speed by decreasing time down to 1,5s/it and overall quality of fines details and prompt stability, especially limbs. It also saves a few MB VRAM...

Update 2: Switching to FLUX.1-dev-ControlNet-Union-Pro-2.0 Controlnet Model and integrating nunchaku-flux.1-dev (https://huggingface.co/nunchaku-tech/nunchaku-flux.1-dev) Diffuser, which basically a SVDQuant quantized INT4 FLUX.1-dev model, solved the problem once and for all. I´m now at ~1s/it, which is tremendously good for a RTX3080, while batch generation is stable and reliable. It even saves roughly 2GB VRAM on my system.

RTX3080 Laptop, 16GB VRAM, Comfyui v.0.3.67, Pytorch 2.8.0+cu129, Flux1 dev fp8, Flux1 dev controlnet union

eccentricworx avatar Nov 19 '25 14:11 eccentricworx