irqbalance
irqbalance copied to clipboard
No space left on device
In activate_mapping() -ENOSPC
is treated as a transient error. However we recently see tens of these on every loop (every 10s) until irqbalance is restarted. I imagine the reason they disappear after restart is because there are some interrupts that are only triggered during boot.
But this leads me to question whether there is a mechanism in irqbalance that would prevent too may IRQs being piled on one core. In theory after enough IRQs are moved to a core, the load on it should go up and it should become less atractive to the logic in irqbalance. However if these IRQs generate little load, is there anything that prevents irqbalance getting stuck trying to move everything to a few cores as I'm seeing?
I checked 2 machines that showed this problem, one had 288 cores and 2400-2800 interrupts in /proc/irq/ depending on drivers, the other had 256 cores and ~1370 interrupts. I inspected some structures in the kernel (not my area) that enforced the per-core limits and found it was arch/x86/kernel/apic/vector.c
making use strucures and utilities in kernel/irq/matrix.c
. Each core's cpumap->available
was initially at 203 but when irq_set_affinity() was returning -ENOSPC
, cpumap->available
was at 0 for those cores.