ComfyUI-ELLA Extremely slow generation

Hi, my gpu is gtx 1660 (6 gb) and while using ella my speed drop from 1.5it/s to 5s/it, seems like cuda cores are almost not being used and my CPU does most of the calculations instead

low cuda usage cpu usage

Jul 24 '24 12:07 farvend

Can I take a look at your workflow?

Aug 16 '24 11:08 JettHu

. I've got a similar problem too. But mine wasn't the KSampler that was taking too much time, mine was the ELLA Text Encode. Im using an entry gaming laptop with specs of:

Ryzen5 3550h
gtx 1650(4gb)
24gb ram Same with OP, it uses the cpu instead when encoding. I dont know if that how suppose to work though as Im have very limited programming background. Thanks

Sep 18 '24 01:09 jcatsuki

@jcatsuki I'm not sure if you are still interested in this. But ELLA is indeed utilizing CPU instead of GPU when encoding text unless one of this condition applies:

ComfyUI state that you have NORMAL_VRAM or HIGH_VRAM (in which I will assume so since it will do that with shared memory), and you have a GPU that works with FP16 (16xx series are not one of them according to ComfyUI's code)
you forcibly tell ComfyUI to only use GPU via --gpu-only flag, but that might slow down the diffusion process by a lot if you don't have enough VRAM.

An alternative that works for me but require a little bit of hacky code editing is to edit the model.py in ComfyUI-ELLA directory like so:

this is from roughly line 118, remove the model_management.text_encoder_device() to model_management.get_torch_device() that function exist in ComfyUI and will try to select any acceleration device available.

class T5TextEmbedder:
    def __init__(self, pretrained_path="google/flan-t5-xl", max_length=None, dtype=None, legacy=True):
-        self.load_device = model_management.text_encoder_device()
+        self.load_device = model_management.get_torch_device()

and on roughly line 312:

class ELLA:
    def __init__(self, path: str, **kwargs) -> None:
-        self.load_device = model_management.text_encoder_device()
+        self.load_device = model_management.get_torch_device()

This might not be the most elegant solution but it sure does works well for me, reducing down the encoding time from 6 minutes down to just a couple of seconds.

IMO, there should be an option in ELLA node to either use GPU when available, seperate from ComfyUI's decision, and force GPU or CPU. I will make a pull-request if I make the change.

Nov 23 '24 07:11 Chanakan5591

@Chanakan5591 做得好！你是我的英雄！

Jul 17 '25 13:07 classronin