GPU Reset Behavior on Out-of-Memory with Intel Arc and xe Driver
Hi there,
I'm looking for some insight into how GPU reset is handled when running into out-of-memory (OOM) issues.
My system is running:
- Kernel: 6.15.9.arch1-1
- Intel OneAPI Base Toolkit: 2025.2
- Intel Compute Runtime: 25.27.34303.5
- Driver: xe
Hardware:
- AMD Ryzen 9900X
- Intel Arc B580
- 48GB DDR5 RAM @ 6200MHz
When I run AI workloads like image generation, the GPU occasionally runs out of memory. When that happens, the entire desktop freezes and becomes completely unresponsive, requiring a hard reboot to recover. I'm particularly wary of hard resets since I have a couple of mechanical drives configured in a RAID array, and I'd really prefer to avoid any risk of data corruption or filesystem damage.
I just realized I can connect my monitor to the Ryzen 9900X's integrated GPU and pass through the Intel Arc GPU exclusively to my AI containers. That should help isolate the desktop environment from any GPU-related crashes. That said, I'm still curious about how the GPU reset process works when it runs out of memory.
AFAIK GPU resets happen only if:
- You have GPU hang detection enabled (which is often disabled with compute workloads as those can take a long time and trigger it although workloads is not stuck), and
- Batch takes longer than the configured hang timeout.
(Resets could go compute engine -> whole GPU -> whole bus, if problem doesn't resolve with more targeted reset.)
I think what happens on VRAM OOM, is that kernel driver starts to page memory between VRAM and system memory, and such paging can slow down things considerably. I would imagine larger VRAM allocs to fail instead of causing OOM though (I'm not sure whether VRAM allocs even allow overcommit, like like Linux does for normal system memory).
Disclaimer: I'm just another user, not a kernel/user-space driver developer.
@arguellocarlos could it be the same problem? https://gist.github.com/savely-krasovsky/18865f4295b876b5d7ab7373e61111ee It usually followed by GPU resets I believe.