irqbalance No space left on device

No space left on device

Open balrog-kun opened this issue 10 months ago • 7 comments

In activate_mapping() -ENOSPC is treated as a transient error. However we recently see tens of these on every loop (every 10s) until irqbalance is restarted. I imagine the reason they disappear after restart is because there are some interrupts that are only triggered during boot.

But this leads me to question whether there is a mechanism in irqbalance that would prevent too may IRQs being piled on one core. In theory after enough IRQs are moved to a core, the load on it should go up and it should become less atractive to the logic in irqbalance. However if these IRQs generate little load, is there anything that prevents irqbalance getting stuck trying to move everything to a few cores as I'm seeing?

I checked 2 machines that showed this problem, one had 288 cores and 2400-2800 interrupts in /proc/irq/ depending on drivers, the other had 256 cores and ~1370 interrupts. I inspected some structures in the kernel (not my area) that enforced the per-core limits and found it was arch/x86/kernel/apic/vector.c making use strucures and utilities in kernel/irq/matrix.c. Each core's cpumap->available was initially at 203 but when irq_set_affinity() was returning -ENOSPC, cpumap->available was at 0 for those cores.

Mar 30 '24 01:03 balrog-kun

irqbalance irqbalance copied to clipboard

No space left on device

irqbalance
irqbalance copied to clipboard