T5 hasn't offloaded on MPS
Expected Behavior
During inference SD3 or FLUX on MPS I expected that T5 text encoder would be offloaded out of ram after a prompt was encoded.
Actual Behavior
T5 encoder is staying in the memory the whole time during and after inference.
Steps to Reproduce
Used official FLUX workflow https://comfyanonymous.github.io/ComfyUI_examples/flux/
Debug Logs
During inference with PYTORCH_DEBUG_MPS_ALLOCATOR=1 of fp16 model (24gb) with fp16 t5, gpu-only flag and all clip devices = mps:
Attempting to release cached buffers (MPS allocated: 34.20 GB, other allocations: 1.60 GB)
Attempting to release cached buffers (MPS allocated: 33.36 GB, other allocations: 2.87 GB)
Attempting to release cached buffers (MPS allocated: 32.34 GB, other allocations: 3.53 GB)
If t5 is not used for SD3 inference, allocated RAM is lower exactly by 10gb.
Other
I'm not good at programming, but looks like the problem is connected to the fact that all text encoders are offloaded to cpu if only-gpu flag is not used (in the case of MPS that's the same memory as gpu's), but not unloaded fully.
MPS only has unified memory, unlike PCs with separate RAM and VRAM. If T5 is unloaded on MPS, then it is flushed from memory entirely. When you send a new prompt it will need to be reloaded from disk, which is slow. I think it is better to keep it in memory
MPS only has unified memory, unlike PCs with separate RAM and VRAM. If T5 is unloaded on MPS, then it is flushed from memory entirely. When you send a new prompt it will need to be reloaded from disk, which is slow. I think it is better to keep it in memory
That's the point, right now I'm unable to run Flux because T5 encoder is loaded in memory. If there would be opportunity to unload it somehow, it would be possible to at least run Flux on MPS. Slow running is better than nothing.
@comfyanonymous is it possible to unload T5 from memory (not offload to cpu) on mps if the --lowvram is used?
@comfyanonymous is it possible to unload T5 from memory (not offload to cpu) on mps if the --lowvram is used?
Are you trying to perform text encoding only once and not change the prompt afterwards?
In that case, you can set up a method using an independent workflow that caches the result of CLIPTextEncode through the Backend Cache nodes of the Inspire Pack and doesn't use the clip loader.
@comfyanonymous is it possible to unload T5 from memory (not offload to cpu) on mps if the --lowvram is used?
Are you trying to perform text encoding only once and not change the prompt afterwards?
In that case, you can set up a method using an independent workflow that caches the result of CLIPTextEncode through the
Backend Cachenodes of the Inspire Pack and doesn't use the clip loader.
Cool! Thanks, looks like that solves the problem!