seL4 Improving IRQ Handling

Sharing some thoughts about what would be nice to see implemented in seL4, to increase the efficiency of virtualization use-cases for IRQs specifically. @axel-h @kent-mcleod @lsf37

Kernel IRQ-level:

Separation of Priority-drop from Deactivation (Feature supported by GICv2/3/4).
Dynamically enable/disable Maintenance IRQ for the underflow condition based on the number of pending LRs.

Kernel vCPU-level:

Adding a bitmap for each vCPU object with 2-bits per pending vIRQ, max 1024 vIRQ as spec mentions.
Add pending vIRQ (on demand - syscall), and schedule pending vIRQ helper functions (on vcpu entry/kernel exit).
Direct injection of the vtimer IRQ since when triggered it belongs to the current vCPU. We can add a new IRQ State - IRQVCPUstate.
New vCPU syscall to bind an IRQHandler cap to a vCPU cap. (Instead of Notification Object, we can use a vCPU Object)

(i) I'm not aware of the sel4 verification story, I'd like to know if this brings any problems to it?

(ii) Also, I'd like to know your thoughs about improving the IRQ Handling with some of the above ideas (if not all).

Apr 22 '23 08:04 JorgeMVP

Wow, you must be psychic because I'd been thinking about making this issue earlier today also.

Separation of Priority-drop from Deactivation (Feature supported by GICv2/3/4).

Yes, I've been trying to find time to set up some benchmarks to measure the impact of this to show the impact of the changes. (And then figure out all the edge cases to migrate the current design).

Dynamically enable/disable Maintenance IRQ for the underflow condition based on the number of pending LRs.

Can you elaborate what you mean here?

Adding a bitmap for each vCPU object with 2-bits per pending vIRQ, max 1024 vIRQ as spec mentions.

Is this for the state that is currently tracked at user level so that the kernel can directly refill LRs without involving userlevel?

New vCPU syscall to bind an IRQHandler cap to a vCPU cap. (Instead of Notification Object, we can use a vCPU Object)

I was thinking that this would kind of fit within the existing IRQHandler model. When an IRQ is delivered to a CPU running seL4, it looks up the slot in the IRQ array and if it finds a notification cap with send rights, the kernel knows to deliver a signal to that cap. (The seL4_IRQHandler_SetNotification invocation is what sets the notification cap to the IRQ handler slot.) So doing a similar thing with the VCPU would insert a VCPU cap into the IRQ array with a similar invocation. Then when the kernel received a physical IRQ, it could try to inject it into the VCPU if it found a cap. If there was spare slot in the LR it could fall back to a VCPU fault delivered to the fault handler. Additionally, this would give the kernel the context needed to be able to mark the virq with the HW flag required to auto deactivate the interrrupt at the end.

One reason this would be hard to implement though is if the VCPU wasn't active and associated with the currently running thread, it's not immediately obvious what the kernel should do, but it's likely worth figuring this out.

Direct injection of the vtimer IRQ since when triggered it belongs to the current vCPU.

Once there's a way to auto deactivate the vtimer irq then I think through would make sense.

Apr 22 '23 09:04 kent-mcleod

(ii) Also, I'd like to know your thoughs about improving the IRQ Handling with some of the above ideas (if not all).

If there's an obvious way to allow physical IRQs to be associated with a VCPU and allow all interrupt virtualization to be trivially done by the kernel without requiring user level involvement then I think there'd be a strong motivation. I'm not sure how user level should keep the emulated distributor and redistributor register maps synchronized with the kernel's view. Is this what the bitmap would be for? For tracking IRQ enabled/disabled state and pending/active? Currently there's no shared memory regions allowed between the kernel and user level, so this approach could be harder to verify, but instead querying/updating the bitmap values from userlevel via invocations would probably be acceptable if it wasn't for common operations...

Apr 22 '23 09:04 kent-mcleod

Separation of Priority-drop from Deactivation (Feature supported by GICv2/3/4).

Yes, I've been trying to find time to set up some benchmarks to measure the impact of this to show the impact of the changes. (And then figure out all the edge cases to migrate the current design).

Is this what's implemented in #952? If so, I didn't measure any significant change in system performance when doing some simple IRQ heavy tests in a Linux VM (I think I did something like drop all caches and do time ls -R on /). That said, the change makes sense even if it doesn't make a big difference in performance.

Apr 22 '23 12:04 Indanz

Wow, you must be psychic because I'd been thinking about making this issue earlier today also.

Separation of Priority-drop from Deactivation (Feature supported by GICv2/3/4).

Yes, I've been trying to find time to set up some benchmarks to measure the impact of this to show the impact of the changes. (And then figure out all the edge cases to migrate the current design).

There is no negative impact in terms of performance, it might sligtly increase it for the general case.

This feature is crucial because enables one main use-case (i) and solves one problem (ii):

(i) Virtualization: Trapping EOI per HW IRQ is very inneficient, and we should and can avoid. AFAIK, EOI bit should be used only for one reason, emulating level sensitive vIRQs. This is needed if our vmm needs to emulate IRQ of some peripheral that has level sensitive semantics.

Req: Enabling HW-bit in the LRs require separation of priority drom from deactivation.

(ii) Native:

IssueIRQHandlerCap - enables the IRQ

void setIRQState(irq_state_t irqState, irq_t irq)
{
    intStateIRQTable[IRQT_TO_IDX(irq)] = irqState;
#if defined ENABLE_SMP_SUPPORT && defined CONFIG_ARCH_ARM
    if (IRQ_IS_PPI(irq) && IRQT_TO_CORE(irq) != getCurrentCPUIndex()) {
        doRemoteMaskPrivateInterrupt(IRQT_TO_CORE(irq), irqState == IRQInactive, IRQT_TO_IDX(irq));
        return;
    }
#endif
    maskInterrupt(irqState == IRQInactive, irq);
}

There is a time window between issuing an IRQHandler cap and setNotification that if the device fires the IRQ then it gets undelivered (missing IRQ).

I know there is the assumption that the bootloader doesnt leave the devices in an undefined state but this assumption alings with ideal world not with real one. Bootloaders use devices and there is no reason to have undelivered irqs. From my point of view there is a clear advantage to have a complete separate syscall to just enable/disable or mask/unmaks the irq on the gic.

Separation of prioritiy drop from deactivation will enable this last case, since mask/unmaks is no longer used for IRQ handling.

Also posted on ARM forum to see if there is any performance implication.

https://community.arm.com/support-forums/f/architectures-and-processors-forum/54506/gicv2-vs-gicv3-differences

Apr 23 '23 19:04 JorgeMVP

Dynamically enable/disable Maintenance IRQ for the underflow condition based on the number of pending LRs.

Can you elaborate what you mean here?

I think we should use this one to know when we can refill the LRs with more vIRQs. All hypervisor do the same, from Xen to KVM, Bao, xVisor, etc. It is a common feature. EOI bit is inneficient, we should avoid kernel traps. seL4 can also support that, it is easy to implement.

Apr 23 '23 19:04 JorgeMVP

Adding a bitmap for each vCPU object with 2-bits per pending vIRQ, max 1024 vIRQ as spec mentions.

Is this for the state that is currently tracked at user level so that the kernel can directly refill LRs without involving userlevel?

This is actually the overflow list of IRQs, we can implement it using a bitmap of vIRQs that can be index directly with vIRQ id (with requirement that vIRQ ID == pIRQ ID).

We need 2 bits because then we can encode up to 4 different types of injected vIRQs:

00 - inactive 01 - pending sw vIRQ with trap on EOI 10 - pending sw vIRQ without trap on EOI 11 - pending hw vIRQ (Hw mapped). Physical state gets automatically deactivated at the vIRQ EOI.

Apr 23 '23 19:04 JorgeMVP

New vCPU syscall to bind an IRQHandler cap to a vCPU cap. (Instead of Notification Object, we can use a vCPU Object)

I was thinking that this would kind of fit within the existing IRQHandler model. When an IRQ is delivered to a CPU running seL4, it looks up the slot in the IRQ array and if it finds a notification cap with send rights, the kernel knows to deliver a signal to that cap. (The seL4_IRQHandler_SetNotification invocation is what sets the notification cap to the IRQ handler slot.) So doing a similar thing with the VCPU would insert a VCPU cap into the IRQ array with a similar invocation. Then when the kernel received a physical IRQ, it could try to inject it into the VCPU if it found a cap. If there was spare slot in the LR it could fall back to a VCPU fault delivered to the fault handler. Additionally, this would give the kernel the context needed to be able to mark the virq with the HW flag required to auto deactivate the interrrupt at the end.

Agree, but we can make it easier, if we have the overflow bitmap, we dont need to send the signal, we always inject in the LRs, if not possible set the corresponding bit on the bitmap of overflow vIRQs.

Apr 23 '23 19:04 JorgeMVP

One reason this would be hard to implement though is if the VCPU wasn't active and associated with the currently running thread, it's not immediately obvious what the kernel should do, but it's likely worth figuring this out.

Direct injection of the vtimer IRQ since when triggered it belongs to the current vCPU.

I think if the vCPU is not active then we should mask the vTIMER, if it fires it belongs always to the current vcpu, if currently there is a thread running the timer should be masked until the vcpu gets scheduled again.

Using CNTV_CTL, IMASK bit

Once there's a way to auto deactivate the vtimer irq then I think through would make sense.

For this, and based on experience, I have done this in two different ways and I'm fine with both:

(i) using the HW-bit set to 1 to map the vIRQ to the pIRQ.

When the timer vIRQ gets deactivated then the pIRQ also gets deactivated. The problem here is if there is multiple vCPUs running on the same pCPU then we should at save/restore set/clear the active state of the pIRQ, which is a bit weird but that works fine.

(i) vIRQ completely separated from pIRQ.

We receive the vTimer pIRQ in the kernel, then inject a vIRQ without EOI set to 1 (no problem with deactivation, it eventually get deactivated we dont care) in the vcpu, the we mask using the IMASK bit of CNTV_CTL, and we deactivate the pIRQ in the kernel.

This requires the guest to unmask the vIRQ, Linux does it, if the guest does use the unmask feature then it breaks, but today most of OSes do it. Also most hypervisors have this approach because it's simpler but on the other hand slightly breaks the generic timer spec (since the guest will notice the IMASK bit being automatically set).

Apr 23 '23 20:04 JorgeMVP

Separation of Priority-drop from Deactivation (Feature supported by GICv2/3/4).

Yes, I've been trying to find time to set up some benchmarks to measure the impact of this to show the impact of the changes. (And then figure out all the edge cases to migrate the current design).

Is this what's implemented in #952? If so, I didn't measure any significant change in system performance when doing some simple IRQ heavy tests in a Linux VM (I think I did something like drop all caches and do time ls -R on /). That said, the change makes sense even if it doesn't make a big difference in performance.

Agree, this will not be noticed on this particular feature but will enable several other improvements on top of that. IRQs handling for guest OSs is currently slow, we need to change it on order to be competitive.

Apr 23 '23 20:04 JorgeMVP

For tracking IRQ enabled/disabled state and pending/active?

We can discuss about emulation as well, let it be in the end :) Overflow list per vCPU is the main requirement. A vCPU object is currently 12-bits size so we have enough space to place our bitmap of overflow vIRQs.

Thanks for the comments ;)

Apr 23 '23 20:04 JorgeMVP

It seems like this would also require the kernel to handle wfi/wfe trapping in the guest? Currently if WFI's are trapped, the VCPU thread becomes blocked on a fault-ep until it's fault handler resumes it after a timeout or injection of interrupt. If the kernel is injecting the IRQs automatically, the VCPU thread would need to have it's fault replied to.

Apr 23 '23 22:04 kent-mcleod

This sounds like a fairly big change, so it would need funding to verify.

In particular changes like a new object binding sound expensive. Notification binding was one of the examples that was very expensive to verify in comparison to the relatively small amount of code that was changed. Before we even discuss that, we should explore all other options.

The other parts don't sound so bad, but they accumulate, so my objective would be to minimise change while still supporting what you need. There is a list of proposed features, and people with virtualisation background probably don't need more than that, but for me it's currently unclear what problems we're trying to solve, what the features improve specifically, and what other points there are in the design space to achieve that.

Apr 24 '23 02:04 lsf37

One thing we probably should also consider is that how the GICv4 with directly injected virtual LPIs will reduce the overhead.

Apr 25 '23 04:04 yyshen

There is a list of proposed features, and people with virtualisation background probably don't need more than that, but for me it's currently unclear what problems we're trying to solve, what the features improve specifically, and what other points there are in the design space to achieve that.

I think we can do it step by step, we don't need to add all features at once. Even if not all features are accepted, at least a subset of them would be great already.

First problem is related to hardware, seL4 today does not properly leaverage ARM virtualization extensions. There is some support, but currently lacks the main features of GICv2/3. Also the way, seL4 uses it, is not the recommended way for what GICv2/3 was designed to. GICv2/3 is a perfect match for seL4 with the separation of the prioritiy drop from deactivation feature.

Second problem refet to:

https://arxiv.org/abs/2303.11186 (Shedding Light on Static Partitioning Hypervisors for Arm-based Mixed-Criticality Systems)
https://github.com/seL4/seL4/issues/663
Above bullets shows some indicators of slow downs in seL4. The kernel has influence on it. Not only the VMM part, or even CAmKES. Please, check the interrupt part of both.

There are more problems (that can be solved), but lets start from these two.

If there was a very limiting reason not to solve this issues, I'd be okay. But today, other than verification, I dont see any blockers there. I'd like to help to make seL4 (as an hypervisor) as great as XEN and KVM, performance wise.

From the last seL4 summit, I got the impression we want "seL4 everywhere". I'd say lets make it happen!

Do you know what the state of verification seL4 running as Hypervisor on Aarch64 is?

Apr 28 '23 07:04 JorgeMVP

One thing we probably should also consider is that how the GICv4 with directly injected virtual LPIs will reduce the overhead.

We don't see many platforms with GICv4 today. Most likely, this reduces overhead, but ideally all the problems of sw have to be fixed in the sw side, otherwise it would be a workaround.

Also GICv4 is not supported by seL4 today.

There is no strong reason for not doing so. Simple features :)

Apr 28 '23 07:04 JorgeMVP

It seems like this would also require the kernel to handle wfi/wfe trapping in the guest? Currently if WFI's are trapped, the VCPU thread becomes blocked on a fault-ep until it's fault handler resumes it after a timeout or injection of interrupt. If the kernel is injecting the IRQs automatically, the VCPU thread would need to have it's fault replied to.

Right, something to think about ;) is that currently implemented on the CAmKES VMM?

Apr 28 '23 07:04 JorgeMVP

Right, something to think about ;) is that currently implemented on the CAmKES VMM?

Not really, there's parts for handling WFI faults and resuming on IRQ injection, but it's missing emulation of the vtimer compare registers when the guest isn't running.

Apr 28 '23 07:04 kent-mcleod

seL4 seL4 copied to clipboard

Improving IRQ Handling

seL4
seL4 copied to clipboard