akka.net ActorGraphInterpreter Box Caching

Fixes #

Changes

Adds Box-Pools for various struct messages used by ActorGraphInterpreter

Checklist

For significant changes, please ensure that the following have been completed (delete if not relevant):

[ ] This change follows the Akka.NET API Compatibility Guidelines.
[ ] This change follows the Akka.NET Wire Compatibility Guidelines.
[ ] I have reviewed my own pull request.
[ ] Design discussion issue #
[ ] Changes in public API reviewed, if any.
[ ] I have added website documentation for this feature.

Latest `dev` Benchmarks

TBD after feedback

This PR's Benchmarks

TBD after feedback

Mar 11 '24 16:03 to11mtm

Waiting for feedback on approach before I provide benchmarks (Also have to make sure we have some good ones)

But local testing on ValueTask branch indicates a 5-15% reduction in memory allocation on even existing async calls, with equal or improved performance throughout.

Mar 11 '24 16:03 to11mtm

@to11mtm I'll get to this this week with an initial review, but looks promising

Mar 11 '24 18:03 Aaronontheweb

If you could help explain the approach for why we're doing it this way, I'd have an easier time understanding this code some. Do you mind?

Of course and I apologize for terseness in this POC, the lift vs benefit got me a bit excited. :)

Right now, ActorGraphInterpreter is using a bunch of structs in it's internal processing flow. This causes boxing on .Tell() and some vagaries of struct unboxing in large switch type checks (I'll try to find some data, it's in my notes somewhere)

What this PR does, is change some things up to minimize allocations.

ActorGraphInterpreter.Resume is made into a class, as when the instance is created, we are setting it as a readonly anyway, so this gets rid of some boxing where it is involved (i.e. any async ops, need to check if others)
ActorGraphInterpreter.AsyncInput, ActorGraphInterpreter.OnNext, and ActorGraphInterpreter.RequestMore get their own special boxes, and we pool them, via thin wrapper around ConcurrentQueue<T>

The main purpose of the structure of said boxes, was to minimize overall code changes; since IBoundaryEvent is required to get into that function, It seemed like a safe 'hack' that was overall cleaner.

Where the 'maybes' come in:

Is there a better pool to use than ConcurrentQueue<T> in this fashion.
1. Will note, the choice of that pool was with intent of avoiding any locks and minimizing interlocked operations, maybe there's a better way to do that with a little more allowance for waste?
Do we want to expose settings for pool sizes?
1. Probably.
Should I look into what perf hits do or do not exist if we genericize the boxing to minimize code bloat

Mar 12 '24 23:03 to11mtm

@to11mtm got it - benchmark it and let's see if it makes any difference. I'd probably prefer to just move everything to classes than having clever object pooling stuff that can bite us in the butt. Alternatively, maybe we can use generics or something else to stick with structs and eliminate boxing OR allocations OR pooling altogether.

Mar 14 '24 00:03 Aaronontheweb

@to11mtm got it - benchmark it and let's see if it makes any difference.

Generic Boxes looks... IDK They do seem to somehow add some extra jitter to benchmarks 🤷‍♂️ but the reduced code bloat is worth it, unless the benchmarks

Actual benchmark numbers are in next reply (trying to get better about bench spam), I had to re-run everything with a more 'fair' pool abstraction (original one was too ugly even for my taste, new one I'm re-benching before commit is fair, but still probably can be improved.)

That said in general stuff looks... curious. I have bad jitter luck on my box, which complicates comparing branches at times.

I think there's something small I'm missing, as I see 'equal or slightly better' perf, or 'maybe slightly worse or test jitter' perf... memory allocation is definitely improved though

I'd probably prefer to just move everything to classes than having clever object pooling stuff that can bite us in the butt.

Well, the pooling still helps; If we used classes but didn't pool, we may save on unbox vagaries but still will be allocating. I just am not good at writing a pool that is safe and fast enough for everyone to be happy with. 😅.

And, Ironically, I have to say that the 'box pool' made this easy to do safely. We pull the struct out of the 'box', pass it to the next thing, the box use is all 'internal' to ActorGraphInterpreter.cs, blast radius stays minimized. When we try to pool a class explicitly, yes that can be faster but it does also add a lot of complication to doing it and being able to sleep at night.

I should also add, after some looking, Box-pools -may- be a solution for some other challenges we have around GetAsyncCallback<T>, namely, avoiding box on the value type result case... Seeing a lot of potential there...

Mar 17 '24 21:03 to11mtm

After:

Method	Mean	Error	StdDev	Gen0	Allocated
RunSelectAsync	31.62 us	0.334 us	0.313 us	3.8452	17.83 KB
RunSelectAsyncSync	29.08 us	0.322 us	0.301 us	3.8757	17.83 KB

Method	Mean	Error	StdDev	Gen0	Allocated
UnfoldAsyncYieldInConsume	173.7 us	1.46 us	1.37 us	8.7891	40.49 KB
UnfoldAsyncYieldInPush	154.5 us	2.89 us	2.97 us	7.5684	34.25 KB

Method	Mean	Error	StdDev	Gen0	Allocated
UnfoldResourceAsyncNoYield	252.06 us	4.953 us	6.264 us	14.1602	67320 B
UnfoldResourceAsyncWithYield	223.95 us	3.978 us	3.721 us	12.9395	60929 B
StraightChannelReadNoYield	67.54 us	1.333 us	1.182 us	-	256 B
StraightChannelReadWithYield	80.44 us	0.727 us	0.645 us	-	269 B

before

Method	Mean	Error	StdDev	Gen0	Allocated
RunSelectAsync	31.83 us	0.374 us	0.350 us	3.9063	17.94 KB
RunSelectAsyncSync	29.25 us	0.360 us	0.337 us	3.9063	17.94 KB

Method	Mean	Error	StdDev	Gen0	Allocated
UnfoldAsyncYieldInConsume	169.8 us	1.64 us	1.37 us	9.7656	45.23 KB
UnfoldAsyncYieldInPush	154.1 us	1.17 us	1.04 us	7.8125	38.99 KB

Method	Mean	Error	StdDev	Gen0	Allocated
UnfoldResourceAsyncNoYield	242.33 us	2.044 us	1.812 us	16.3574	76713 B
UnfoldResourceAsyncWithYield	211.84 us	1.311 us	1.162 us	15.1367	70322 B
StraightChannelReadNoYield	65.31 us	0.293 us	0.274 us	-	256 B
StraightChannelReadWithYield	82.70 us	0.678 us	0.634 us	-	271 B

Mar 17 '24 21:03 to11mtm

akka.net akka.net copied to clipboard

ActorGraphInterpreter Box Caching

Changes

Checklist

Latest dev Benchmarks

This PR's Benchmarks

akka.net
akka.net copied to clipboard

Latest `dev` Benchmarks