ComfyUI icon indicating copy to clipboard operation
ComfyUI copied to clipboard

Setting to prevent GPU overheating

Open wmsouza opened this issue 1 year ago • 7 comments

I added a --max-gpu-temperature setting to prevent GPU overheating.

I know it is the most elegant place to put it, but I hooked it up in the progress bar updater as any heavy task (like KSampler) uses it.

I do have a custom node project that does something similar but it fails to prevent overheating when the batch size is too big or KSampler has too many steps, this PR fixes that as it will wait for the GPU to cool down between steps.

wmsouza avatar Nov 02 '23 14:11 wmsouza

I think this should be placed at here: https://github.com/comfyanonymous/ComfyUI/blob/dd116abfc48e8023bb425c2dd5bd954ee99d7a9c/execution.py#L122

Furthermore, if https://github.com/comfyanonymous/ComfyUI/pull/931 gets merged in the future, I believe it should also be incorporated here.

ltdrdata avatar Nov 02 '23 14:11 ltdrdata

I think you should underclock your GPU or increase your fan speed if you have temperature issues instead of doing this.

comfyanonymous avatar Nov 02 '23 15:11 comfyanonymous

I think you should underclock your GPU or increase your fan speed if you have temperature issues instead of doing this.

I tried, it is a laptop, it doesn't work, even with the fan on turbo mode and a cooling base, it still reaches 95+ with too many steps/images and then the whole machine shuts down. I've been using this code now, and for the first time I've been able to run with 16+ batch size for AnimateDiff.

There is more people with the same issue as there is at least 2 custom nodes that try to do the same:

  • https://github.com/meap158/ComfyUI-GPU-temperature-protection
  • https://github.com/wmsouza/comfyui-gpucooldown (mine) But a node can not help between steps, that is why I did this PR.

wmsouza avatar Nov 02 '23 15:11 wmsouza

My laptop also heat up a lot, I was using a custom node I made to sleep some seconds, but your PR made me realize I could just watch the temperature and wait it cool down, I just made a new node for me using you idea and it works better then just waiting a fixed time, thanks for the idea hahaha, but I use my fork of ComfyUI with some PR (#1566, #1572) not merged in the main branch, so I can connect it anywhere in the workflow.

image

With the #1566 I can have generic types, in this node I use it in the flow input to connect any type and I just output the same value, with this I ensure the order, I also use it in the dependency inputs to be able to connect any type, the dependency inputs use the #1572 to have multiple inputs, so I can have any amount of dependencies and the node will run after all dependencies are complete.

I also use those PR to other nodes like Math, Arrays, Conditions, etc, it is really helpful for the kind of nodes I make to myself.

jn-jairo avatar Nov 03 '23 03:11 jn-jairo

  • But a node can not help between steps

Can't you create a node that installs a model wrapper function (there's an interface for that) and just waits when it detects overheating before forwarding to the actual model?

EDIT: yeah, you can, do something like this in your node:

def doit(self, model):
  def wrapper(modelfn, *args, **kwargs):
    wait_gpu_cooldown()
    return modelfn(*args, **kwargs)

  m = model.clone()
  m.set_model_unet_function_wrapper(wrapper)
  return (m,)

asagi4 avatar Nov 05 '23 19:11 asagi4

I think you should underclock your GPU or increase your fan speed if you have temperature issues instead of doing this.

I'd be strongly looking into this if I were you even if you've got it fixed in the meantime, with laptops it's even more likely to be a stuck fan (dustballs) or something. I haven't had a CPU or GPU shut down a machine via heat in over 15 years and I still have suspicions that that was really a PSU trip issue due to low voltages being delivered over the wiring in an old apartment / and or the power supply itself being one of the out of spec models that shows up from time to time. Yours is still doing the right thing and killing the system but heat kills NVMe drives at insane rates if it's not coming from them and they can't just throttle and dissipate it... the way laptops are designed that's usually not a well cooled component.

I don't know what kind of GPU you have but since both major brands benefit to different degrees from undervolting (AMD) or voltage curve adjustment with afterburner (NVidia) you can keep it the same speed but running cooler or in the case of something like the 4090 drop like 100W of power draw and lose 3% performance... according to gamers anyway; they're rarely actually using full load on their cards so it's hard to say what that translates to when running a bunch of tensor cores that games don't even access at full blast.

NeedsMoar avatar Dec 20 '23 21:12 NeedsMoar

It is usually the fan that loses power over time on these gaming laptops, at least from my own experience. My GPU is a 2070 super (mobile). CPU runs fine but GPU if running at 100% keeps going up until it hits mid 90’s and then it shuts down.

But with this PR it has been running fine, hasn’t shutdown once, I am even able to use AnimateDiff with XL models.

On Wednesday, December 20, 2023, NeedsMoar @.***> wrote:

I think you should underclock your GPU or increase your fan speed if you have temperature issues instead of doing this.

I'd be strongly looking into this if I were you even if you've got it fixed in the meantime, with laptops it's even more likely to be a stuck fan (dustballs) or something. I haven't had a CPU or GPU shut down a machine via heat in over 15 years and I still have suspicions that that was really a PSU trip issue due to low voltages being delivered over the wiring in an old apartment / and or the power supply itself being one of the out of spec models that shows up from time to time. Yours is still doing the right thing and killing the system but heat kills NVMe drives at insane rates if it's not coming from them and they can't just throttle and dissipate it... the way laptops are designed that's usually not a well cooled component.

I don't know what kind of GPU you have but since both major brands benefit to different degrees from undervolting (AMD) or voltage curve adjustment with afterburner (NVidia) you can keep it the same speed but running cooler or in the case of something like the 4090 drop like 100W of power draw and lose 3% performance... according to gamers anyway; they're rarely actually using full load on their cards so it's hard to say what that translates to when running a bunch of tensor cores that games don't even access at full blast.

— Reply to this email directly, view it on GitHub https://github.com/comfyanonymous/ComfyUI/pull/1890#issuecomment-1865170016, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBTZP5FYHH5LUM46AEP25DYKNKMTAVCNFSM6AAAAAA62644T6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRVGE3TAMBRGY . You are receiving this because you authored the thread.Message ID: @.***>

wmsouza avatar Dec 21 '23 01:12 wmsouza