ComfyUI T5 hasn't offloaded on MPS

Expected Behavior

During inference SD3 or FLUX on MPS I expected that T5 text encoder would be offloaded out of ram after a prompt was encoded.

Actual Behavior

T5 encoder is staying in the memory the whole time during and after inference.

Steps to Reproduce

Used official FLUX workflow https://comfyanonymous.github.io/ComfyUI_examples/flux/

Debug Logs

During inference with PYTORCH_DEBUG_MPS_ALLOCATOR=1 of fp16 model (24gb) with fp16 t5, gpu-only flag and all clip devices = mps:

Attempting to release cached buffers (MPS allocated: 34.20 GB, other allocations: 1.60 GB)
Attempting to release cached buffers (MPS allocated: 33.36 GB, other allocations: 2.87 GB)
Attempting to release cached buffers (MPS allocated: 32.34 GB, other allocations: 3.53 GB)

If t5 is not used for SD3 inference, allocated RAM is lower exactly by 10gb.

Other

I'm not good at programming, but looks like the problem is connected to the fact that all text encoders are offloaded to cpu if only-gpu flag is not used (in the case of MPS that's the same memory as gpu's), but not unloaded fully.

Aug 04 '24 14:08 tombearx

MPS only has unified memory, unlike PCs with separate RAM and VRAM. If T5 is unloaded on MPS, then it is flushed from memory entirely. When you send a new prompt it will need to be reloaded from disk, which is slow. I think it is better to keep it in memory

Aug 05 '24 04:08 Adreitz

MPS only has unified memory, unlike PCs with separate RAM and VRAM. If T5 is unloaded on MPS, then it is flushed from memory entirely. When you send a new prompt it will need to be reloaded from disk, which is slow. I think it is better to keep it in memory

That's the point, right now I'm unable to run Flux because T5 encoder is loaded in memory. If there would be opportunity to unload it somehow, it would be possible to at least run Flux on MPS. Slow running is better than nothing.

Aug 11 '24 02:08 tombearx

@comfyanonymous is it possible to unload T5 from memory (not offload to cpu) on mps if the --lowvram is used?

Aug 11 '24 02:08 tombearx

@comfyanonymous is it possible to unload T5 from memory (not offload to cpu) on mps if the --lowvram is used?

Are you trying to perform text encoding only once and not change the prompt afterwards?

In that case, you can set up a method using an independent workflow that caches the result of CLIPTextEncode through the Backend Cache nodes of the Inspire Pack and doesn't use the clip loader.

Aug 11 '24 06:08 ltdrdata

@comfyanonymous is it possible to unload T5 from memory (not offload to cpu) on mps if the --lowvram is used?

Are you trying to perform text encoding only once and not change the prompt afterwards?

In that case, you can set up a method using an independent workflow that caches the result of CLIPTextEncode through the Backend Cache nodes of the Inspire Pack and doesn't use the clip loader.

Cool! Thanks, looks like that solves the problem!

Aug 11 '24 18:08 tombearx