Mages GC Pressure

Hi Florian, getting a playable build ready for my application to the GDC Experimental Gameplay Workshop. I've hit an optimization bottleneck with GC pressure from MAGES, and I was wondering if there's anywhere that we can trim it down.

Here's a screenshot from Unity's profiler, with the garbage allocated per frame on the far right and the number of calls next to that. magesgc

If I'm reading this right, it seems like the cumulative weight of the local variables that get declared on the heap for each anonymous function are really adding up. Is that correct?

Jan 04 '17 20:01 polytroper

I'm not sure what MAGES code you are running against, but from my perspective it seems alright. The only thing I am confused about is the compile; are you really compiling once per frame?! I would rather only compile on a change (and that may happen every 60 to 300 frames on average).

Jan 04 '17 20:01 FlorianRappl

I'm running the MAGES dev branch, v1.3.0.

Just double-checked, I definitely only compile when the function source is changed. Actually it would be great if I had messed that up because that would be way simpler...

I think this is the issue:

If you use local variables in a lambda it needs to be on the heap. The lambda might be used after the function which created it exits. Normal local variables (living on the stack/registers) become invalid when the function exits, so they can't be used here.

So the C# compiler creates a class to hold captured local variables. That's the one you're seeing.

Note that C# captures the actual variable, not its current value. So conceptually it's captured by reference. The semantics of capturing mean that the compiler needs to create one container object per scope.

(source)

The graph shown here in the profiler samples 82 points per frame (which is why those call counts are all multiples of 82). If each operation in MAGES is implemented with at least one lambda, and each of those puts even a few bytes on the heap, it seems plausible that 82 samples of even a simple function—in this case ((x + -8)/4)^2 + -16—would end up tossing out almost 70KB of garbage per frame.

At 60fps that adds up quick, and thanks to the awful garbage collection in Mono 2.6 the result is big hangs in scenes that exceed a certain threshold of realtime graphing :(

EDIT: I said anonymous functions and I really meant lambdas.

Jan 04 '17 22:01 polytroper

All the internal functions are using the Curry helpers, which are heavily using lambdas. So yes, all of them are also capturing local variables, i.e., they will create helper objects (from an anonymous class) to store these. I guess I can optimize this quite heavily!

Thanks for spotting this!

Jan 04 '17 23:01 FlorianRappl

Any deadline for the v1.6.0 version when the optimizations should be available?

Jan 04 '17 23:01 FlorianRappl

Great! This is huge for my current performance constraints. Replacing all the lambdas in the runtime sounds like a huge pain in the butt, but hey at least it's a fixable problem.

It would be nice to have it in early February if possible. GDC starts on Feb. 27th, so I'd be grateful any time you can give me before then to produce content without the GC constraint.

Jan 04 '17 23:01 polytroper

This is the current status

[x] Implemented fast add, sub, ... (skipping currying) for the operator usage
[ ] Improve Curry.X helpers (may not be possible by much - as they need capturing; but see above: this won't affect the bare op. perf. anymore)
[x] Improve the If.Is calls (explicitly cache delegate instances)
[x] Initialize the stack to 64 elements (should prevent most Array.Resize calls)

As far as the delegate instance caching is concerned this is normally done by the .NET JIT, so either Unity works differently here or this optimization was turned off. The debug mode could have switched it off. Either way with the explicit caching we don't have to rely on the JIT here.

Jan 05 '17 14:01 FlorianRappl

Wow, that was fast. I'll recompile MAGES and give it a shot!

Jan 05 '17 15:01 polytroper

The GetcOperation may also be improved by pre-caching the arguments array. This, however, was not done to prevent race conditions / bugs when being used from multiple-threads. In your case (single-thread) it could be very beneficial though. Currently, I am thinking how this can be improved without race conditions or locking (a lock would be problematic for the execution time performance).

Jan 05 '17 15:01 FlorianRappl

Hmm. On the one hand, I'm not planning on implementing multithreading in the near future. On the other hand, I would like to at some point and data races are evil.

The changes haven't been committed yet, right? Github doesn't indicate changes to either branch.

Jan 05 '17 16:01 polytroper

Up now!

Jan 05 '17 17:01 FlorianRappl

Yikes. Initializing the stack to 64 elements spiked garbage from the ExecutionContext constructor from 2.6kB to 44kB. Might be best for me to just eat the cost of resizing the stack...

Otherwise garbage has dropped from about 67kB to 57kB. Looks like most of that is coming from GetcOperation. Here's the profiler after reverting back to the default stack initializer: magesgc3

Jan 05 '17 18:01 polytroper

I am not quite sure what the profiler is seeing / Unity is doing here. For instance, all the delegate instances are static readonly fields, i.e., there is no allocation per call going on. The GetcOperation memory is consumed by the argument arrays as explained above.

Potentially, a stack size of 8 (or 16) would already be sufficient. Try these and check the required resize operations.

Still I am not sure against what MAGES code you are testing (apparently it consists of an add, a negative number, plus power and division operators. It could make sense to have an optimizer in compilation mode, which performs trivial operations already (thus eliminating such operations from being performed for each frame).

Jan 05 '17 18:01 FlorianRappl

Sorry, should have specified. I've been testing on the same scene every time, which has a single graph sampling ((x + -8)/4)^2 + -16 82 points per frame.

I'll fiddle with different initial stack sizes and see what works best.

I'm more concerned about the overhead from GetcOperation, as that's where the majority of the GC pressure is coming from. Is pre-caching the arguments array simple enough for me to informally splice it in? Take your time on if/how that can be addressed without breaking thread safety, but if a quick workaround is possible there it would help a lot since I'm on a single thread anyway.

Jan 05 '17 19:01 polytroper

@SigmaEpsilonChi Did you make any additional progress on this? Currently evaluating expression parsers for Unity and MAGES seems like it may be a good option for performance purposes. @FlorianRappl Would also be curious as to some hints into implementing pre-caching for the arguments array?

Off topic, but @SigmaEpsilonChi did you ever transpile your project uses MAGES into other Unity apps (such as iOS or WebGL) successfully?

May 11 '17 19:05 wootencl

The essential idea regarding the pre-caching for the arguments array was to have a (global, i.e., static) array of Object[], where the position in the array is the length of the arguments, i.e.,:

0 = new Object[0]
1 = new Object[1]
2 = new Object[2]
...

Up to a reasonable number (e.g., 16).Anything larger would either throw (ouch) or could be handled with the hit of at least 1 more check.

From the top of my head I think it could work (in a single-threaded scenario, multi-threading is gone with this option!).

May 11 '17 21:05 FlorianRappl

Mages Mages copied to clipboard

GC Pressure

Mages
Mages copied to clipboard