Fooocus icon indicating copy to clipboard operation
Fooocus copied to clipboard

[Bug]: amdgpu freeze resulting in GPU reset on large workloads

Open infinity0 opened this issue 4 months ago • 7 comments

Checklist

  • [X] The issue has not been resolved by following the troubleshooting guide
  • [X] The issue exists on a clean installation of Fooocus
  • [X] The issue exists in the current version of Fooocus
  • [X] The issue has not been reported before recently
  • [ ] The issue has been reported before but has not been fixed yet

What happened?

I understand the amdgpu support is experimental, however I want to document this issue to guide others who run into it. My system specs:

  • GPU: Sapphire Nitro+ AMD Radeon RX 7800 XT
  • RAM: 64GB
  • Swap: 16GB
  • GLX version: Mesa 24.0.2-1

Steps to reproduce the problem

When I ask Fooocus to do "too much", my display will freeze including keyboard/mouse and it appears I have to reboot the system. In fact, later I found this is not necessary, I can just log in via SSH and restart the display server e.g. systemctl restart lightdm. I observe this on dmesg:

[Mar27 23:05] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=6436600, emitted seq=6436602
[  +0.000160] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 321623 thread Xorg:cs0 pid 321627
[  +0.000127] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
[  +0.001546] amdgpu 0000:03:00.0: amdgpu: Guilty job already signaled, skipping HW reset
[  +0.000011] [drm] Skip scheduling IBs!
[  +0.000001] amdgpu 0000:03:00.0: amdgpu: GPU reset(10) succeeded!
[  +0.000005] [drm] Skip scheduling IBs!
[  +0.000004] [drm] Skip scheduling IBs!
[  +0.000002] [drm] Skip scheduling IBs!
[  +0.000003] [drm] Skip scheduling IBs!
[  +0.000002] [drm] Skip scheduling IBs!

Also, apparently there is something online called the "AMD GPU reset bug" - but my GPU does not seem to be affected by this in that I can trigger this bug many times, cause my screen to freeze, observe GPU reset(n) succeeded! via dmesg where n keeps going up by 2 each time, restart my display server via systemctl restart lightdm, and everything is fine afterwards, and I can start Fooocus again to do more stuff. In other words, this bug is not that bug.

What is "too much"? Well for me for 64 RAM normally it is like, running a Windows VM, watching a HD video, generating Upscale 2x with Performance = Quality on Fooocus, and running Upscayl at the same time. This is fine to avoid manually, I can just be careful when running Fooocus.

HOWEVER, you can also easily trigger it by giving Fooocus an input image that is quite big, even if the computer is doing nothing else. For example this one, 12 megapixels:

Causes GPU freeze, "Harvesting" oil painting by David Cox Jnr

harvesting

This is more annoying to avoid because sometimes you just want to drag and drop random shit from online into Fooocus and not have to worry about how big it is.

What should have happened?

Ideally, Fooocus should throw an exception in these cases, with something like "Out Of Memory" (or whatever the real reason is) rather than letting the GPU freeze up and reset. I'm not sure how feasible this is however.

What browsers do you use to access Fooocus?

Google Chrome

Where are you running Fooocus?

Locally

What operating system are you using?

Debian GNU/Linux

Console logs

dmesg logs are above. As for Fooocus logs, in fact Fooocus itself does not notice the problem, and there are no logs. The screen freezes, but you can run Fooocus inside a tmux session and attach to it by logging in via SSH, to confirm that there are in fact no logs and no errors. Nothing is output on the Fooocus tmux console, even though dmesg says that the GPU has already been reset. You can even tell Fooocus to quit with Ctrl-C after this, and it will tell you it's trying to exit, but this won't succeed and it just hangs there until you restart your display server.

Additional information

No response

infinity0 avatar Mar 27 '24 23:03 infinity0

Also, apparently there is something online called the "AMD GPU reset bug" - but my GPU does not seem to be affected by this in that I can trigger this bug many times, cause my screen to freeze, observe GPU reset(n) succeeded! via dmesg where n keeps going up by 2 each time, restart my display server via systemctl restart lightdm, and everything is fine afterwards, and I can start Fooocus again to do more stuff. In other words, this bug is not that bug.

Well, occasionally the GPU fails to reset then I do have to reboot the machine. So perhaps I'm also affected by the reset bug. This is a minor occurrence however, most of the time I can simply restart the display server without rebooting.

[Mar28 15:09] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[  +0.000145] amdgpu: failed to remove hardware queue from MES, doorbell=0x1802
[  +0.000002] amdgpu: MES might be in unrecoverable state, issue a GPU reset
[  +0.000003] amdgpu: Failed to evict queue 1
[  +0.000001] amdgpu: Failed to evict process queues
[  +0.000002] amdgpu: Failed to evict queues of pasid 0x8009
[  +0.000019] amdgpu 0000:03:00.0: amdgpu: GPU reset begin!
[ .. hangs here, we don't get "GPU reset(n) succeeded!" as above ..]
[ .. attempts to restart the display server hang, instead of succeeding as above .. ]

Anyway, it's clear that these are two separate issues.

infinity0 avatar Mar 28 '24 15:03 infinity0

Haved you searched for this issue in all of the other open discussions/issues for AMD? https://github.com/lllyasviel/Fooocus/issues?q=is%3Aissue+is%3Aopen+amd

Seems to be a duplicate of https://github.com/lllyasviel/Fooocus/issues/1690, please check his out.

mashb1t avatar Mar 29 '24 18:03 mashb1t

As I explained in great detail both in that ticket and this ticket, this ticket is not a duplicate of that ticket. Please re-open.

infinity0 avatar Mar 31 '24 00:03 infinity0

I'm sorry to say that I personally can't help you to debug and get to the bottom of the issue as I don't have access to an AMD GPU. Hopefully the community can support here.

mashb1t avatar Mar 31 '24 06:03 mashb1t

No problem, I am not expecting an easy fix soon - the ticket is more for documentation purposes and to help others, the important thing being you don't need to reboot if you can SSH in.

infinity0 avatar Apr 01 '24 11:04 infinity0