node-replicated-kernel icon indicating copy to clipboard operation
node-replicated-kernel copied to clipboard

Unbounded TLB WorkQueue

Open hunhoffe opened this issue 2 years ago • 0 comments

Describe the bug

The TLB WorkQueue is currently not bounded in size. It holds both shootdown requests (which are important to handle) and advance replica work requests (which are less important to handle).

Currently, if the queue is full and enqueue is called the error is ignored: https://github.com/vmware-labs/node-replicated-kernel/blob/fc25186d57ca400c8e4a7cb313deb8eabd21d971/kernel/src/arch/x86_64/tlb.rs#L112

If this is uncommented, it becomes clear that some requests may be dropped if the queue is full.

Reproduction steps

  1. Change the line to check for failure to enqueue (use expect to unwrap the result)
  2. Run the fxmark benchmark with 96-ish cores
  3. Most of the time, it will cause an error.

Expected behavior

We would like a scenario where the queue has a theoretical bound, so that we can ensure it is always possible to enqueue. This is an important property because, overall, we just want to make sure shootdowns are not lost.

Additional context

No response

hunhoffe avatar Apr 05 '23 03:04 hunhoffe