compute-runtime icon indicating copy to clipboard operation
compute-runtime copied to clipboard

GPU Reset Behavior on Out-of-Memory with Intel Arc and xe Driver

Open arguellocarlos opened this issue 5 months ago • 3 comments

Hi there,

I'm looking for some insight into how GPU reset is handled when running into out-of-memory (OOM) issues.

My system is running:

  • Kernel: 6.15.9.arch1-1
  • Intel OneAPI Base Toolkit: 2025.2
  • Intel Compute Runtime: 25.27.34303.5
  • Driver: xe

Hardware:

  • AMD Ryzen 9900X
  • Intel Arc B580
  • 48GB DDR5 RAM @ 6200MHz

When I run AI workloads like image generation, the GPU occasionally runs out of memory. When that happens, the entire desktop freezes and becomes completely unresponsive, requiring a hard reboot to recover. I'm particularly wary of hard resets since I have a couple of mechanical drives configured in a RAID array, and I'd really prefer to avoid any risk of data corruption or filesystem damage.

arguellocarlos avatar Aug 07 '25 01:08 arguellocarlos

I just realized I can connect my monitor to the Ryzen 9900X's integrated GPU and pass through the Intel Arc GPU exclusively to my AI containers. That should help isolate the desktop environment from any GPU-related crashes. That said, I'm still curious about how the GPU reset process works when it runs out of memory.

arguellocarlos avatar Aug 07 '25 03:08 arguellocarlos

AFAIK GPU resets happen only if:

  • You have GPU hang detection enabled (which is often disabled with compute workloads as those can take a long time and trigger it although workloads is not stuck), and
  • Batch takes longer than the configured hang timeout.

(Resets could go compute engine -> whole GPU -> whole bus, if problem doesn't resolve with more targeted reset.)

I think what happens on VRAM OOM, is that kernel driver starts to page memory between VRAM and system memory, and such paging can slow down things considerably. I would imagine larger VRAM allocs to fail instead of causing OOM though (I'm not sure whether VRAM allocs even allow overcommit, like like Linux does for normal system memory).

Disclaimer: I'm just another user, not a kernel/user-space driver developer.

eero-t avatar Aug 20 '25 17:08 eero-t

@arguellocarlos could it be the same problem? https://gist.github.com/savely-krasovsky/18865f4295b876b5d7ab7373e61111ee It usually followed by GPU resets I believe.

savely-krasovsky avatar Nov 03 '25 19:11 savely-krasovsky