cats-effect Can we push to make `IOFiber` fit in 64 bytes?

trafficstars

As of Cats Effect 3.5.0-RC5, this is the memory layout of IOFiber:

cats.effect.IOFiber object internals:
OFF  SZ                                TYPE DESCRIPTION               VALUE
  0   8                                     (object header: mark)     N/A
  8   4                                     (object header: class)    N/A
 12   4                                 int AtomicBoolean.value       N/A
 16   4                                 int IOFiber.masks             N/A
 20   1                                byte IOFiber.resumeTag         N/A
 21   1                             boolean IOFiber.canceled          N/A
 22   1                             boolean IOFiber.finalizing        N/A
 23   1                                     (alignment/padding gap)   
 24   4      scala.collection.immutable.Map IOFiber.localState        N/A
 28   4   scala.concurrent.ExecutionContext IOFiber.currentCtx        N/A
 32   4              cats.effect.ArrayStack IOFiber.objectState       N/A
 36   4              cats.effect.ArrayStack IOFiber.finalizers        N/A
 40   4           cats.effect.CallbackStack IOFiber.callbacks         N/A
 44   4                    java.lang.Object IOFiber.resumeIO          N/A
 48   4        cats.effect.unsafe.IORuntime IOFiber.runtime           N/A
 52   4      cats.effect.tracing.RingBuffer IOFiber.tracingEvents     N/A
 56   4                               int[] IOFiber.conts             N/A
 60   4          cats.effect.kernel.Outcome IOFiber.outcome           N/A
 64   4                      cats.effect.IO IOFiber._cancel           N/A
 68   4                      cats.effect.IO IOFiber._join             N/A
Instance size: 72 bytes
Space losses: 1 bytes internal + 0 bytes external = 1 bytes total

This is already amazing, considering that we were pushing well over 100 bytes around CE 3.3. Does it even make sense to try to get it any lower? My idea is that some x86 CPUs still have 64 byte cache line sizes. This is probably not true for ARM silicon.

Some ideas on the data structures. localState seems very rarely used, IMO, it doesn't need to be a direct field. Could it hold a permanent place as the 0th index of objectState? Could this trick also be used to store other data? On the surface, the only problem I can foresee is that it would force the backing array of objectState to be initialized earlier than it is now. Now, it is initialized when we first run the fiber, not when it is allocated. That could be a problem.

May 01 '23 14:05 vasilmkd

Another idea I have is to try and merge suspended (AtomicBoolean#value) with IOFiber#outcome.

AtomicBoolean is a bit of a deceptive class. It doesn't actually offer any optimization when compared to AtomicReference, only nicer Java API.

Java Object Layout confirms this, the value field is a 4 byte reference. This is kind of mandatory due to the minimum size for atomic hardware operations, which never operate on single bytes, but usually they require a minimum length of 4 bytes (there are also larger atomic instructions).

As to how the merge of suspended and outcome would go, I'm not exactly sure. IOFiber would extend AtomicReference[AnyRef], we would add two special sentinels val Running = new Object(), val Suspended = new Object() and the final value for the atomic reference, when the fiber finishes is an instance of OutcomeIO[A], the same one that would go in outcome.

May 01 '23 15:05 vasilmkd

This is a very interesting thought tbh, particularly tying it to the 64 byte cache line. I think it's absolutely worth trying.

I would start by running JFR quickly on a typical application to get an idea of how often we access each field. The general technique you're talking about here would be to take those fields and shove them into an object indirection. For example, something like:

private final class FiberIndirection {
  var resumeIO: IO[Any] = startIO
  val runtime: IORuntime = rt
}

Then in IOFiber you would have a private[this] val indirection: FiberIndirection = new FiberIndirection. Can even toss an import indirection._ to keep it all nice and tidy. This would collapse 8 bytes down to 4 at the cost of a pointer-chase every time we access those fields. We basically only need to vacuum up the three least-accessed fields into this type of indirection in order to get down to the 64 byte window.

So I think if we figure out what the least accessed stuff looks like, shove it into the indirection, then benchmark the results, that should give us a pretty definitive answer as to whether this is worth it. If the cache line thing is kicking in, we should see a pretty noticeable jump in performance, since it would mean that any fiber action would require one page + whatever-else (usually a pointer dereference and another page), as opposed to one or two pages + whatever else.

Btw we might also be able to achieve these same benefits just by reordering the object map so the least accessed things dip below the 64 byte horizon.

May 03 '23 16:05 djspiewak

cats-effect cats-effect copied to clipboard

Can we push to make `IOFiber` fit in 64 bytes?

cats-effect
cats-effect copied to clipboard