GPU crashes regularly while using API (and other times)
There are times when I can run ComfyUI manually for hours, doing lots of different things, swapping out LoRAs, etc. but there are certain situations that consistently crash the GPU. And I don't mean the computer or the video device driver, I mean the video card stop responding on the bus. The computer no longer detects the video card at all. Attempting to install the device driver tells me there is no video card in the system. A reboot is required to get the video card to respond again.
The most consistent way to get it to happen is for me to run a fairly simple prompt over and over using the API (I'm changing the prompt with every run of four images). It crashes pretty consistently every 100 images generated. If I kill and restart the ComfyUI server every 90 images, then it crashes about every 200 images. I'm trying some different things, like pausing between restarts to see if it behaves the same, but it takes a while to run that way.
I don't know how to take ComfyUI out of the loop to see if it's really just Stable Diffusion that is causing the problem, so I figured I come here to ask if there's something I could be doing to isolate the issue (or to stop having the issue at all). Also, it seems that this more common failure rate happened after I did the last pull of ComfyUI, but frankly I'm still learning the system so I've been doing a lot of very different things. This is my first project where I'm generating several thousand images driven by a program.
Here is my configuration: Processor: 12th Gen Intel(R) Core(TM) i9-12900K, 3187 Mhz, 16 Core(s), 24 Logical Processor(s) Motherboard: Z690 Taichi Installed Physical Memory (RAM) 32.0 GB (DDR5) NVIDIA GeForce RTX 3090 24GB VRAM Video driver 31.0.15.4592 (2023-10-19) DirectX 12
Here are the things that I've essentially ruled out: This is not a temperature issue, I'm fully water cooled and the video card hovers right around 62C (140F), or about 70F above ambient, it never thermal throttles. It's also not a video memory issue since it loads up 10.5GB while it's running the iterations and then momentarily jumps up to 17.3GB as it generates the final image files, still showing plenty of headroom (and is extremely consistent since the changes I'm making to the prompt are small). The GPU never gets above 10% utilization. The normal memory never gets above 17GB. CPU usage hovers right around 30%. I'm not doing any overclocking of the CPU or the GPU during these runs. Also, I've never had this problem while playing high-end games or running heavy GPU loads like transcoding.
Can u share the prompts that u run and how often they crash.
I shared how often they crash, so I'm not sure what other info you want, but I've attached the prompt json that I'm running through my API app currently. The app replaces the character string "{xxxx}" in the JSON with changing values. ApiWorkflow_v2.json
Thanks for the suggestions. I did have HWInfo, but you're right I was just being lazy and looking at the task monitor. I'll need to set up remote monitoring with HWInfo since once the crash happens it makes it hard to get the data out (plus it takes long enough, it's hard to be sitting here when it happens). One thing I noticed already in HWInfo is that the GPU hot spot temperature is spiking to 104C, while the average is still in the low 60C range. This itself is not an enormous issue, as it's still within operating range, but it's a bit closer to the limit than I would like. I can make some modifications to push up the airflow and see if I can bring that down.
I also did some testing that indicated that pausing between large runs allowed it to run a lot longer. I set it up to pause for 2 minutes between 90 image runs and I got it to generate 600 images before it crashed. I haven't run the same test again to see if it stays consistent. Now, I've never done any GPU coding, but this makes me wonder if there isn't a garbage collection issue in the card? Otherwise would could be happening in those 2 minutes except allowing the card to do some housekeeping that is keeping it from reaching whatever limit is causing it to fail.
Another interesting this is that GPU Core Load is 100% while a job is running. It really makes me wonder what the task manager is monitoring that it shows "Utilization" at 10%. The system without a job running shows core load at 2%. This does make more sense.
I'm about two-thirds away from finishing the project that I'm working on now, and then I'll focus on repeatability, and maybe seeing if a more complex workflow will cause it to happen sooner. I'll also set up for remote monitoring of the computer and see if that allows me to catch it out doing something unexpected. I'm also curious if running the exact same workflow (without changing the prompts) will cause the same problem.
but this makes me wonder if there isn't a if there isn't a garbage collection issue in the card?
Check the ram usage throughout the run
The "apparent" RAM usage on the GPU is still flat in HWInfo. I'm talking through my hat a bit here, basing my guesses on what I know about virtual machines, which may be completely invalid when talking about a GPU system, and graphics sub-systems. In a virtual machine you can show memory usage as being consistent while still having a tremendously fragmented free list. Garbage collection not only cleans up "active" memory, but might consolidate space to make it possible to load larger images into contiguous memory. My experience is the BITBLT works better (meaning faster) with contiguous segments. I have to assume that the image generation process is creating a lot of small blocks of memory for the images stored between the iterations and for generating the png files at the end (when I see the spike in GPU memory usage). Maybe this is getting cleaned up somehow during the hiatus? This is all conjecture, obviously.
if you're unable to prevent fragmentation, you might consider regularly restarting the ComfyUI server to clear out the memory and start fresh.
I guess you missed that it's restarting the ComfyUI server every time I indicate that it's taking a break in the processing. It restarts it and then it pauses for the waiting time limit (without sending it any new workflows or calling any of the API endpoints). Then it runs the requisite number of workflows, killing the process, restarting it, waiting, until the GPU stops responding.
@42degrees I don't think its necessary to restart ComfyUI server as it primarily runs on ur CPU and leverages ur RAM.
The associated processes needs to cleaned up.
I'm not sure what you mean by "associated processes" but I should try just pausing without restarting ComfyUI (it seemed like a good idea to cycle the whole thing, since I have no idea where the problem is). I also don't get your assertion that it primarily runs on the CPU. It's driving SDXL, so it's definitely driving stuff being pushed up to the GPU. I don't pretend to understand how SD checkpoint is loaded and how the workflow proceeds, but surely ComfyUI is orchestrating all that, so it stands to reason it could be part of the problem.
Right now I'm having good luck with a 2-minute break for every 40 images generated. I've also been using the computer the whole time, browsing, watching Rumble, and I've now generated 960 images without a crash. I wonder if having the computer locked changes the relationship of SDXL to the GPU subsystem, or maybe the device driver simply doesn't do as well in a locked state (it's not likely they test games alot with the console locked). But I have to go to bed soon, so we'll see if it keeps kicking butt once I've locked it again.
That sounds like a hardware or driver issue.
Since there are no other reports of the software making GPUs crash I don't think it's related to the software. Software can make a GPU crash but the driver should reset it and it should recover without having to reboot the computer.
GPUs have an MMU so memory fragmentation shouldn't be an issue at all. I know pytorch does some dumb things with memory but I don't think it's dumb enough for memory fragmentation to be an actual issue.
I wonder if having the computer locked changes the relationship of SDXL to the GPU subsystem, or maybe the device driver simply doesn't do as well in a locked state (it's not likely they test games alot with the console locked).
@42degrees it does.
In my experience, in windows: states-sleep, low power mode has profound impact on GPU performance.
Additionally drivers have power saving features, that it turns off when in low power mode. Noticed those features for wlan drivers particularily
My computer is set to High Performance in all states. I do a lot of work rendering, converting, etc. at night on this rig so I don't allow the computer to slow down. But I do lock the console, just as a matter of cleanliness and security. Many of those processes like video rendering use the GPU all night in a locked state. I have never had a problem with the video card crashing like this. I have had the device driver crash and restart, but remember this is the GPU hardware failing to communicate when this crashes. I can remote desktop into the computer after it has crashed and the computer acts as though I had pulled the GPU from the computer.
GPUs have an MMU so memory fragmentation shouldn't be an issue at all. I know pytorch does some dumb things with memory but I don't think it's dumb enough for memory fragmentation to be an actual issue.
I defer to your undoubted greater understanding of this system. I don't pretend to comprehend it as I'm still a newb in SD and ComfyUI. All I can report is my experience and what I have, and haven't, tried. I can only guess as to what might be happening under the hood.
As of now, I've been running my API process for several hours (9:48PM to 2:00AM), generating 1,348 images so far, with no crash. The graphs in HWInfo look totally consistent from run to run (temp, cpu, gpu). I'm still running 40 images, 2-minute break (restarting ComfyUI), and then looping. I'm about to go to bed and it will be interesting in the morning to see if it crashed or continued all night (I hope so, I have several thousand more images to generate). Finger's crossed that this is the magic sauce to keep it running.
I figured I'd leave you with some eye candy, in case you see something I'm missing (assuming you might be on a whole different time zone):
Update, the process ran for another 2 and half hours and then apparently triggered a system reboot. When I came in this morning it had rebooted, I'll have to look into what might have triggered the reboot. In the end I generated 2,344 images yesterday, which is more than any previous batch.
Well, crap, it was Microsoft Installer that triggered the reboot. Apparently a service pack was installed and forced a system reboot. That sucks. So, I'll try again tonight after I finish my day job, maybe it will run all night.
Well, I have an update. Night before last the process crashed within a couple hours of me locking the terminal. Last night I left the computer unlocked and it didn't crash until 9 AM this morning. It generated 3,496 images last night, which is by far the most I've been able to do overnight since I started this project.
@42degrees will u able to get the resources consumption: CPU, Memory, GPU, VRAM right before the crash. I also want to compare it with the initial resource consumption and resource consumption after 100,200 images generated
Yes, that's still my plan, but I've been pretty busy this week so setting up remote monitoring and storing of that data didn't happen. This weekends going to be pretty busy with Thanksgiving coming up next week. Hopefully I'll have time to set that up.
To you too! Hope you're getting the whole family together!
So, an update. I've been able to keep the process running for over 21 hours. It generated 3,812 images between 2AM last night and 11PM tonight (when it did, unfortunately, crash). I'm only missing 1,766 images out of a bit more than 19,000 images.
Here's my new project. It's still a work in progress at about 90% (who knew that moving 17,000 images up to my website would be so painful).
https://sdxl.42degreesdesign.com/
You're welcome to share that if you like. Except for the "real" headshots it was all generated with ComfyUI and SDXL 1.0.
@42degrees checked it out, its quite cool! Couldn't figure out what the secondary features are !!
OK, well maybe I need to adjust the definition. My interpretation of how SDXL works, is that it gives a lower priority to each term in the prompt as it processes it. So, for example, if you say "big nose" very early in the prompt and later in the prompt you say "small nose" then it will prioritize the first term over the secondary term. This is called weighting the tokens.
In this case I am giving it the names of the models in a specific order. If I say "Cindy" and then after that I say "Samantha" then it will prioritize what it thinks of as a Cindy more than it will what it thinks of as a Samantha. Since these are all famous people who it has seen photos of tagged with their name, it will also prioritize them based on the strength of it's "knowledge" of each person, so if Samantha has 1,000 pictures used to train the model and Cindy has 200, then no matter what order you use in the prompt it's likely to make the final image Samantha. If you prioritize Samantha first, then the combination of the weights of being first and 1,000 pictures seen, then the image will probably look like Samantha and have almost no Cindy in them.
However, since I have no idea what pictures the model was actually trained with, I can only go buy the order, so I call the first model the Primary and the second model Secondary, and have noticed a clear trend while looking at all these pictures while generating them that for models that are "equally famous" (if that term is even measurable) then the base features of the resulting "baby" will come from the primary (skin color, hair color, eye placement, and probably 100 things that I'm not equipped to notice) and other things will come from the Secondary (face shape seems like a common thing that can be seen in the secondary).
So, that's my long-winded description of what I mean by "secondary" features. I welcome any suggestions for a better description (since I can't put all the above on the main site). I did talk about it some in the "Why I did it." page linked at the bottom. Also, you can test how important the ordering is by clicking the "Make Primary" button on the secondary model. This will swap their positions, and their priorities and you can see immediately what primary versus secondary does to the images.
"Why I did it" page is quite inspiring
Thank you, that's great to hear.
Well, I think I found the sweet spot. If I keep my monitors from going to sleep, have a video running in a web browser window and I have it pausing for 2 minutes every 80 images, then it hasn't crashed in over 10 hours of generating images. I need to try removing some of the pause time and maybe lengthening the number of images before restarting ComfyUI to see if those are really contributing. During this time I generated 4,632 images.
Note that I have still had it crash while I was running ComfyUI manually, but it doesn't do it very often, and usually it's when I have been running a long queue (so it's really annoying because I lose the queue).
Something I'm not sure if I mentioned, but when the graphics card crashes, ComfyUI also exits. I don't know which one crashes first. I have still not gotten around to setting up a remote monitoring station. I just started a project that may require me to generate 520,000 images, but I'm going to try to pare that down quite a bit. I suspect a large number of the images will end up being generic, but it will still be a lot of images.
I have the same problem. Ubuntu 22.04 T4 Google Cloud To do this, you do not need to start generation. Just leave ComfyUI and wait 6-10 hours. During this time, ComfyUI will stop, without any errors or information in the log about the stop. I tried different GPU drivers and nodes, the result is always the same.
I have the same problem. Ubuntu 22.04 T4 Google Cloud To do this, you do not need to start generation. Just leave ComfyUI and wait 6-10 hours. During this time, ComfyUI will stop, without any errors or information in the log about the stop. I tried different GPU drivers and nodes, the result is always the same.
So, you are only seeing ComfyUI crash, or are you seeing your video card disappear from the PCIe bus as well?
For my side, I'll try just leaving ComfyUI up for a long time and see if I see any issues. I have not done that, except when I running a bunch of images through the API.
I launched the Comfyui test process inside pm2 (for auto-restart) and I see this result:
=============================================================================== 2024-01-03T13:19:03: PM2 log: --- New PM2 Daemon started ---------------------------------------------------- 2024-01-03T13:19:03: PM2 log: Time : Wed Jan 03 2024 13:19:03 GMT+0000 (Coordinated Universal Time) 2024-01-03T13:19:03: PM2 log: PM2 version : 5.3.0 2024-01-03T13:19:03: PM2 log: Node.js version : 12.22.9 2024-01-03T13:19:03: PM2 log: Current arch : x64 2024-01-03T13:19:03: PM2 log: PM2 home : /home/g4you_app_dev/.pm2 2024-01-03T13:19:03: PM2 log: PM2 PID file : /home/g4you_app_dev/.pm2/pm2.pid 2024-01-03T13:19:03: PM2 log: RPC socket file : /home/g4you_app_dev/.pm2/rpc.sock 2024-01-03T13:19:03: PM2 log: BUS socket file : /home/g4you_app_dev/.pm2/pub.sock 2024-01-03T13:19:03: PM2 log: Application log path : /home/g4you_app_dev/.pm2/logs 2024-01-03T13:19:03: PM2 log: Worker Interval : 30000 2024-01-03T13:19:03: PM2 log: Process dump file : /home/g4you_app_dev/.pm2/dump.pm2 2024-01-03T13:19:03: PM2 log: Concurrent actions : 2 2024-01-03T13:19:03: PM2 log: SIGTERM timeout : 1600 2024-01-03T13:19:03: PM2 log: =============================================================================== 2024-01-03T13:19:03: PM2 log: App [main:0] starting in -fork mode- 2024-01-03T13:19:03: PM2 log: App [main:0] online 2024-01-03T15:27:09: PM2 log: Stopping app:main id:0 2024-01-03T15:27:09: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:09: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:09: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:09: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:09: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:09: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:09: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:10: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:10: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:10: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:10: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:10: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:10: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:10: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:10: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:10: PM2 log: Process with pid 847 still alive after 1600ms, sending it SIGKILL now... 2024-01-03T15:27:11: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:11: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:11: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:11: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:11: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:11: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:11: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:11: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:11: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:11: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:12: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:12: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:12: PM2 log: pid=847 msg=failed to kill - retrying in 100ms 2024-01-03T15:27:12: PM2 log: App [main:0] exited with code [0] via signal [SIGKILL] 2024-01-03T15:27:12: PM2 log: pid=847 msg=process killed 2024-01-03T15:27:12: PM2 log: App [main:0] starting in -fork mode- 2024-01-03T15:27:12: PM2 log: App [main:0] online 2024-01-03T17:36:42: PM2 log: Stopping app:main id:0 2024-01-03T17:36:42: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:42: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:43: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:43: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:43: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:43: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:43: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:43: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:43: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:43: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:43: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:43: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:44: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:44: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:44: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:44: PM2 log: Process with pid 2525 still alive after 1600ms, sending it SIGKILL now... 2024-01-03T17:36:44: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:44: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:44: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:44: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:44: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:45: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:45: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:45: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:45: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:45: PM2 log: pid=2525 msg=failed to kill - retrying in 100ms 2024-01-03T17:36:45: PM2 log: App [main:0] exited with code [0] via signal [SIGKILL] 2024-01-03T17:36:45: PM2 log: pid=2525 msg=process killed 2024-01-03T17:36:45: PM2 log: App [main:0] starting in -fork mode- 2024-01-03T17:36:45: PM2 log: App [main:0] online
The terminal session also crashes at this time. This happens as a result of long-term inactivity of Comfyui (maybe I'm wrong). The system is crashing and I can't figure out what's causing it. Can someone tell me where to look at the log?
Additionally, I periodically have this error (1-2 times a day):
0|main | ERROR:aiohttp.server:Error handling request 0|main | Traceback (most recent call last): 0|main | File "/home/g4you_app_dev/.local/lib/python3.10/site-packages/aiohttp/web_protocol.py", line 350, in data_received 0|main | messages, upgraded, tail = self._request_parser.feed_data(data) 0|main | File "aiohttp/_http_parser.pyx", line 557, in aiohttp._http_parser.HttpParser.feed_data 0|main | aiohttp.http_exceptions.BadStatusLine: 400, message: 0|main | Invalid method encountered: 0|main | b'\x16\x03\x01' 0|main | ^ 0|main | ERROR:aiohttp.server:Error handling request 0|main | Traceback (most recent call last): 0|main | File "/home/g4you_app_dev/.local/lib/python3.10/site-packages/aiohttp/web_protocol.py", line 350, in data_received 0|main | messages, upgraded, tail = self._request_parser.feed_data(data) 0|main | File "aiohttp/_http_parser.pyx", line 557, in aiohttp._http_parser.HttpParser.feed_data 0|main | aiohttp.http_exceptions.BadHttpMessage: 400, message: 0|main | Pause on PRI/Upgrade: 0|main | b'' 0|main | ^
У меня та же проблема. Ubuntu 22.04 T4 Google Cloud Для этого запускать генерацию не нужно. Просто выйдите из ComfyUI и подождите 6-10 часов. За это время ComfyUI остановится, без каких-либо ошибок или информации в журнале об остановке. Я пробовал разные драйверы и узлы графического процессора, результат всегда один и тот же.
Итак, вы видите только сбой ComfyUI или ваша видеокарта также исчезает из шины PCIe?
Со своей стороны, я попробую просто оставить ComfyUI включенным на долгое время и посмотреть, увижу ли я какие-либо проблемы. Я этого не делал, за исключением случаев, когда запускал кучу изображений через API.
In my opinion, this is a server crash or freeze; the terminal session is also interrupted at this time. Unfortunately I cannot look at the availability of the GPU under these conditions.
So was there ever a 'real' solution for this? I have ran ComfyUI since released and generated thousands of images with no issues. Then 2 weeks ago I started running into a similar issue. I run was running an XYPlot on a new series I want to work on and then the display froze up completely. I know the rest of the computer runs fine as I have background things running and I go back and look at the logs and sometimes results and they complete no problems. But thats only after I reboot to recover the display. Since that first time it happens all the time now. I have ran my monitors and my liquid cooled 4090RTX never gets above 69C (my shutoff is set to 80C) I never run out of VRAM nor RAM (I have 256GB) and the CPU runs stable.
Obviously something is happening with the gpu but I can't figure out what.
I updated Comfy and its python as well as my NVidia drivers (currently using the game drivers but am going to switch back to my studio driver shortly). I had a backup copy of my full ComfyUI from before this issue started and tried running it and actually ran into the same issue. So I'm assuming its not an update issue of any sort as the updates happened after I made that backup.
I'm running Python 3.10.9 for Comfy. I run my GPU REALLY hard as I work with high performance scientific visualization and have absolutely no issues with any of my viz packages nor any of my own codes. I also work heavily with all the Adobe products, DaVinci Resolve, Omniverse and several 3D and VFX packages and have no problems. I do alot of astrophotography with PixInsight (and there's many issues with its poor Windoze optimization especially with WBPP) and run flawlessly with it as well and if any application would cause me GPU issues one would assume it would be PixInsight. So only Comfy is causing me issues. I have not gone back and tested SD with A111 but plan on reinstalling it and testing it again but prefer using Comfy with my own custom nodes. I just simply can't since it keeps causing me to reboot ever few images
So was there ever a 'real' solution for this? I have ran ComfyUI since released and generated thousands of images with no issues. Then 2 weeks ago I started running into a similar issue. I run was running an XYPlot on a new series I want to work on and then the display froze up completely. I know the rest of the computer runs fine as I have background things running and I go back and look at the logs and sometimes results and they complete no problems. But thats only after I reboot to recover the display. Since that first time it happens all the time now. I have ran my monitors and my liquid cooled 4090RTX never gets above 69C (my shutoff is set to 80C) I never run out of VRAM nor RAM (I have 256GB) and the CPU runs stable.
It sounds very similar to what I'm seeing. I have not seen a solution. I pulled the latest ComfyUi from GitHub and it seemed to start happening more often. But frankly, I've switched project for the time being, so I haven't done anything with SD since the holiday break. I expect to get back to it in a week or two, but I don't have a lot of faith that things will be any better. I started conversations in several different places and collected a scant few other people who might be seeing the same thing. Unfortunately, no solid patterns have been identified.
Since you mentioned it, I'm running the "game ready" nVidia drivers. I do mostly video processing, After Effects, transcoding, and gaming on this rig. Frankly the gaming tends to put more strain than any application except Stable Diffusion (the others are time consuming, but don't raise the temperature of the GPU to the same extent).
If you come up with ideas, I'd love to hear them and would be willing to try them.
Recently I have had a very similar problem to this. Using cu124-megapak, on linux 6.10.10-arch1-1 with docker.
Nvidia drivers: Driver Version: 560.35.03 CUDA Version: 12.6
I have two GPUS with two comfy instances. The comfyui instance on the 2070 GPU is not updated, and it does not have this problem. It's able to generate fine.
The 4090 GPU is using the latest ComfyUi version with everything updated. It seems after some time it will get stuck and refuse to generate. It is like the socket connection drops completely and is unable to reconnect. I can still restart the comfyui instance from the interface as usual POST requests appear to work fine. Then it will work for X amount of hours or minutes before facing the same issue.
My hunch is that it isn't related to drivers or hardware. Previous to the upgrade I've had the exact same drivers and did not face this issue.
I am still trying to pin point exactly where the issue is. The best I have at the moment is somehow the socket connection is broken, and it cannot function as expected.
Edit: I had forgot to mention also there is nothing in logs I've seen to indicate what the problem may be.