akka.net
akka.net copied to clipboard
ActorGraphInterpreter Box Caching
Fixes #
Changes
Adds Box-Pools for various struct messages used by ActorGraphInterpreter
Checklist
For significant changes, please ensure that the following have been completed (delete if not relevant):
- [ ] This change follows the Akka.NET API Compatibility Guidelines.
- [ ] This change follows the Akka.NET Wire Compatibility Guidelines.
- [ ] I have reviewed my own pull request.
- [ ] Design discussion issue #
- [ ] Changes in public API reviewed, if any.
- [ ] I have added website documentation for this feature.
Latest dev Benchmarks
TBD after feedback
This PR's Benchmarks
TBD after feedback
Waiting for feedback on approach before I provide benchmarks (Also have to make sure we have some good ones)
But local testing on ValueTask branch indicates a 5-15% reduction in memory allocation on even existing async calls, with equal or improved performance throughout.
@to11mtm I'll get to this this week with an initial review, but looks promising
If you could help explain the approach for why we're doing it this way, I'd have an easier time understanding this code some. Do you mind?
Of course and I apologize for terseness in this POC, the lift vs benefit got me a bit excited. :)
Right now, ActorGraphInterpreter is using a bunch of structs in it's internal processing flow. This causes boxing on .Tell() and some vagaries of struct unboxing in large switch type checks (I'll try to find some data, it's in my notes somewhere)
What this PR does, is change some things up to minimize allocations.
ActorGraphInterpreter.Resumeis made into a class, as when the instance is created, we are setting it as a readonly anyway, so this gets rid of some boxing where it is involved (i.e. any async ops, need to check if others)ActorGraphInterpreter.AsyncInput,ActorGraphInterpreter.OnNext, andActorGraphInterpreter.RequestMoreget their own special boxes, and we pool them, via thin wrapper aroundConcurrentQueue<T>
The main purpose of the structure of said boxes, was to minimize overall code changes; since IBoundaryEvent is required to get into that function, It seemed like a safe 'hack' that was overall cleaner.
Where the 'maybes' come in:
- Is there a better pool to use than
ConcurrentQueue<T>in this fashion.- Will note, the choice of that pool was with intent of avoiding any locks and minimizing interlocked operations, maybe there's a better way to do that with a little more allowance for waste?
- Do we want to expose settings for pool sizes?
- Probably.
- Should I look into what perf hits do or do not exist if we genericize the boxing to minimize code bloat
@to11mtm got it - benchmark it and let's see if it makes any difference. I'd probably prefer to just move everything to classes than having clever object pooling stuff that can bite us in the butt. Alternatively, maybe we can use generics or something else to stick with structs and eliminate boxing OR allocations OR pooling altogether.
@to11mtm got it - benchmark it and let's see if it makes any difference.
Generic Boxes looks... IDK They do seem to somehow add some extra jitter to benchmarks 🤷♂️ but the reduced code bloat is worth it, unless the benchmarks
Actual benchmark numbers are in next reply (trying to get better about bench spam), I had to re-run everything with a more 'fair' pool abstraction (original one was too ugly even for my taste, new one I'm re-benching before commit is fair, but still probably can be improved.)
That said in general stuff looks... curious. I have bad jitter luck on my box, which complicates comparing branches at times.
I think there's something small I'm missing, as I see 'equal or slightly better' perf, or 'maybe slightly worse or test jitter' perf... memory allocation is definitely improved though
I'd probably prefer to just move everything to classes than having clever object pooling stuff that can bite us in the butt.
Well, the pooling still helps; If we used classes but didn't pool, we may save on unbox vagaries but still will be allocating. I just am not good at writing a pool that is safe and fast enough for everyone to be happy with. 😅.
And, Ironically, I have to say that the 'box pool' made this easy to do safely. We pull the struct out of the 'box', pass it to the next thing, the box use is all 'internal' to ActorGraphInterpreter.cs, blast radius stays minimized. When we try to pool a class explicitly, yes that can be faster but it does also add a lot of complication to doing it and being able to sleep at night.
I should also add, after some looking, Box-pools -may- be a solution for some other challenges we have around GetAsyncCallback<T>, namely, avoiding box on the value type result case... Seeing a lot of potential there...
After:
| Method | Mean | Error | StdDev | Gen0 | Allocated |
|---|---|---|---|---|---|
| RunSelectAsync | 31.62 us | 0.334 us | 0.313 us | 3.8452 | 17.83 KB |
| RunSelectAsyncSync | 29.08 us | 0.322 us | 0.301 us | 3.8757 | 17.83 KB |
| Method | Mean | Error | StdDev | Gen0 | Allocated |
|---|---|---|---|---|---|
| UnfoldAsyncYieldInConsume | 173.7 us | 1.46 us | 1.37 us | 8.7891 | 40.49 KB |
| UnfoldAsyncYieldInPush | 154.5 us | 2.89 us | 2.97 us | 7.5684 | 34.25 KB |
| Method | Mean | Error | StdDev | Gen0 | Allocated |
|---|---|---|---|---|---|
| UnfoldResourceAsyncNoYield | 252.06 us | 4.953 us | 6.264 us | 14.1602 | 67320 B |
| UnfoldResourceAsyncWithYield | 223.95 us | 3.978 us | 3.721 us | 12.9395 | 60929 B |
| StraightChannelReadNoYield | 67.54 us | 1.333 us | 1.182 us | - | 256 B |
| StraightChannelReadWithYield | 80.44 us | 0.727 us | 0.645 us | - | 269 B |
before
| Method | Mean | Error | StdDev | Gen0 | Allocated |
|---|---|---|---|---|---|
| RunSelectAsync | 31.83 us | 0.374 us | 0.350 us | 3.9063 | 17.94 KB |
| RunSelectAsyncSync | 29.25 us | 0.360 us | 0.337 us | 3.9063 | 17.94 KB |
| Method | Mean | Error | StdDev | Gen0 | Allocated |
|---|---|---|---|---|---|
| UnfoldAsyncYieldInConsume | 169.8 us | 1.64 us | 1.37 us | 9.7656 | 45.23 KB |
| UnfoldAsyncYieldInPush | 154.1 us | 1.17 us | 1.04 us | 7.8125 | 38.99 KB |
| Method | Mean | Error | StdDev | Gen0 | Allocated |
|---|---|---|---|---|---|
| UnfoldResourceAsyncNoYield | 242.33 us | 2.044 us | 1.812 us | 16.3574 | 76713 B |
| UnfoldResourceAsyncWithYield | 211.84 us | 1.311 us | 1.162 us | 15.1367 | 70322 B |
| StraightChannelReadNoYield | 65.31 us | 0.293 us | 0.274 us | - | 256 B |
| StraightChannelReadWithYield | 82.70 us | 0.678 us | 0.634 us | - | 271 B |