runtime icon indicating copy to clipboard operation
runtime copied to clipboard

CLR/JIT should optimize "alloc temporary small object" to "alloc on stack" automatically

Open ygc369 opened this issue 8 years ago • 90 comments

If a small object's reference is never stored in heap or static variables, (only stored in local variables), it is temporary, the CLR or JIT should alloc it on stack and free it in time. Thus, GC pressure would be lower. This feature is just like escape analysis technique in JVM. I've suggested this in Roslyn forum(https://github.com/dotnet/roslyn/issues/2104), but they said I should suggest here.

category:cq theme:stack-allocation skill-level:expert cost:large

ygc369 avatar Oct 16 '15 02:10 ygc369

Thanks, this is something the team is aware of and discussed. A specific example in a real-world workload that you can share would help us prioritize this request. We already have synthetic examples.

cmckinsey avatar Oct 16 '15 03:10 cmckinsey

You may be also interested with the following proposal and prototype implementation project: Stackalloc for reference types with Roslyn and CoreCLR [1] [2]

It introduce a generalized stackalloc with a transient variable modififer to solve the problem.

The transient variable modifier
Lets start by defining the concept of a transient variable:
  A transient variable can only be declared for a local method variable or parameter, and cannot be used on a ref/out parameter.
  A transient variable can only be assigned to another transient variable.
  A transient variable can receive a non transient variable as long as types matches

[1] http://xoofx.com/blog/2015/10/08/stackalloc-for-class-with-roslyn-and-coreclr/ [2] https://github.com/xoofx/StackAllocForClass

zpodlovics avatar Oct 16 '15 07:10 zpodlovics

@cmckinsey

A specific example in a real-world workload that you can share would help us prioritize this request.

This might be a chicken and egg situation. If the optimization doesn't exist people will try to twist the code to avoid allocations (see for example params support for IEnumerable<T>) and as a result there will be fewer optimization opportunities. On the other hand if the optimization exists then perhaps people will concentrate more on actually writing code instead of figuring out how to avoid allocations.

mikedn avatar Oct 16 '15 09:10 mikedn

It introduce a generalized stackalloc with a transient variable modififer to solve the problem.

You will need the same constraints that ValueTypes have so you may as well use a struct instead of a class. For example, you can not have class finalizers. The only difference I see is that the compiler is calling a ctor for you instead of having to call MyStruct.Initialize(); yourself.

OtherCrashOverride avatar Oct 16 '15 18:10 OtherCrashOverride

@mikedn I don't disagree, but we need some way to help assess priority as we have other opportunities were we can also improve code-quality and this would help to prioritize. We have prototyped this kind of transformation in other experimental systems. Because of the challenges in the compiler automatically finding legal opportunities, which is even more challenging in JIT environments, my experience has been that developers that really want to avoid heap allocations end up having to do manual source changes anyway, such as using ValueTypes.

@OtherCrashOverride yes, that is correct. And there are other correctness rules the compiler must observe, for example this transformation might lead to stack-overflow if done on a recursion path.

cmckinsey avatar Oct 17 '15 00:10 cmckinsey

Did the team look at the work the Hotspot JVM team did with Escape Analysis? They have production experience in a very similar setting and should be able to answer many open questions.

You should be able to run benchmarks on the JVM with Escape Analysis turned on and off. They have a command line switch. The performance differential should carry over to the CLR.

Prime use cases would be enumerators (LINQ queries) and string processing I think. These can generate tons of garbage. Also things like new FileInfo(path).Exists. Or updating immutable data structures multiple times in the same method.

GSPP avatar Oct 25 '15 12:10 GSPP

You should be able to run benchmarks on the JVM with Escape Analysis turned on and off. They have a command line switch. The performance differential should carry over to the CLR.

The situation is a bit more complicated than that. Java doesn't have value types and as such it needs allocation optimizations more than .NET needs. Also, it's interesting that the "Escape Analysis for Java" paper shows that for some benchmarks the percentage of objects allocated on the stack can exceed 70% but the execution time improvement is no more than 23%. And as far as I can tell that also includes improvements due to the elimination of locks for objects that don't escape the thread.

Prime use cases would be enumerators (LINQ queries) and string processing I think.

Of all possible use cases LINQ is the least likely to benefit from such an optimization. It's interface reliance means that it is difficult to inline methods or at least perform some interprocedural analysis. Without inlining the created enumerators are returned and it's difficult to allocate such objects on the stack. They cannot be allocated on the callee frame and anyway the callee has no way of knowing that the object doesn't escape the stack, only the caller knows that. I suppose one could imagine some sort of "return allocation context" that's passed from the caller via a hidden function parameter and allows the caller to customize the way returned objects are allocated but that's a rather crazy thing to do.

Personally I'd be happy enough if, for example, small arrays could be allocated on the stack so contortions like that described in "params support for IEnumerable" wouldn't be needed. But unfortunately the cost to achieve this might be too great.

mikedn avatar Oct 25 '15 17:10 mikedn

The Java remarks are entirely true.


Regarding LINQ, here's an idea: If the callee is known to allocate a small, bounded amount of memory (this is generally the case for enumerator allocating methods) and if no such pointer escapes according to some simple data flow tracing (also the case here) then change the base pointer for heap allocations to the stack temporarily. Like this:

  1. Set allocation base to free spot on the stack (at the end of it).
  2. Call allocating callee. It will now allocate everything on the stack. The allocations can be dynamic (e.g. conditional). They do not need to be statically known.
  3. Callee returns pointer (to stack memory)
  4. Caller restores allocation base pointer

This rather simple scheme should match a lot of functions. No need to inline.

Call targets must be known but that also does not require inlining. It requires the JIT to determine the return type (which is pretty much always known in the case of iterator methods) and apply that knowledge to devirtualize all calls. I do not need to inline OrderBy in order to determine that it returns exactly an OrderedEnumerable. On the other hand this will not work for Select because it is too sophisticated.

No need to inline but it requires analyzing quite a few functions (all that touch the memory allocated).


A second simple scheme would be to allocate all ref types on the stack if their address is never passed to any other method, returned or stored to non-locals. And then have a (reliable) way to blast all aggregates allocated on the stack to registers. This scheme would catch a lot simple helper classes like vectors, points, some strings, Utf8String's, ....

If we can get a basic implementation of both of these optimizations that should catch a lot. And btw, I think that 23% execution time improvement is a fantastic number. Depends on the workload obviously.

GSPP avatar Oct 25 '15 17:10 GSPP

Call allocating callee. It will now allocate everything on the stack. The allocations can be dynamic (e.g. conditional). They do not need to be statically known.

It can't allocate everything on the stack, some allocations may escape and the callee won't always know which ones. You really need to distinct allocation contexts - "GC allocation context" and "return allocation context". The return context is normally the same as the GC context but the caller may replace it when possible. The callee always uses the return context to allocate the returned object. The return context cannot be used to allocate any other objects, including objects that are referenced by the returned object since the caller may escape those objects.

A second simple scheme would be to allocate all ref types on the stack if their address is never passed to any other method, returned or stored to non-locals.

That's the "normal" scheme. Though it's a bit too restrictive, calling any method of an object implies passing the object address to the method (unless the method is inlined) so you wouldn't be able to do much with such objects. The compiler should analyze the callees a bit for best results, a call depth of less than 5 might suffice in many cases. For example, it should be possible to stack allocate the Stack object used in the bellow example:

void foo(BasicBlock bb) {
    var stk = new Stack<BasicBlock>();
    stk.Push(bb);
    while (bb.Count > 0) {
        bb = stk.Pop();
        // other code that calls stk.Push
    }
}

mikedn avatar Oct 25 '15 18:10 mikedn

OK, all true. I really should not specify any particular (naive) escape analysis scheme but I think it's possible to devise some rather simple scheme that kills a lot of allocations and enables developers to be more productive because they can use more helper objects.

I think a key point is that inlining does not have to occur in order to infer information about the behavior of methods. Often, escape behavior can be summarized for a method in a meaningful way ("does not store to the heap/statics at all" or "does not create a new reference from the heap/statics for argument 3"). We now don't have a code size problem, only need to deal with compile time.

GSPP avatar Oct 25 '15 21:10 GSPP

I think a key point is that inlining does not have to occur in order to infer information about the behavior of methods.

Yes, in general inlining doesn't need to occur. It's just the "return new object" case that benefits from inlining.

mikedn avatar Oct 26 '15 05:10 mikedn

@cmckinsey Here's a practical example where escape analysis could be really helpful.

Let's have a look at this snippet in F# code:

// ReferenceEscapeTest.fs
module EscapeTest

    // a:int -> b:int -> int
    let diff a b =
        let (min, max) = if a > b then (b, a) else (a, b)
        max - min

From functional programming standpoint this is a very primitive construction which utilizes a tuple for dealing with two local variables in one expression. For an F# developer those are just cheap local variables, presumably allocated on stack.

fsharpc --target:library --optimize+ ReferenceEscapeTest.fs
F# Compiler for F# 3.1 (Open Source Edition)

But when we look at the disassembled IL:

using Microsoft.FSharp.Core;
using System;

[CompilationMapping(SourceConstructFlags.Module)]
public static class EscapeTest
{
  [CompilationArgumentCounts(new int[] {1, 1})]
  public static int diff(int a, int b)
  {
    Tuple<int, int> tuple = a <= b ? new Tuple<int, int>(a, b) : new Tuple<int, int>(b, a);
    return tuple.Item2 - tuple.Item1;
  }
}

We can see that, in fact, we're unintentionally abusing heap allocation for a basic arithmetic operation.

P. S. This might (or might not) be addressed by F# compiler in the first place, I just not aware of design decisions what led to this code generation in this particular case, but my point is that this optimization on the JIT side could definitely make the difference.

kostrse avatar Nov 04 '15 16:11 kostrse

Example code that can benefit from this is riddled all over FxCore; E.g: HastSet.IntersectWithEnumerable and CheckUniqueAndUnfoundElements both allocate a temporary BitHelper object. List.InsertRange allocates a temporary array. ConcurrentBag new's up a SpinWait in the CanSteal method.

Also, almost all code that has a "using" statement has an object that goes out of scope can be cleaned up immediately. No need to wait for the garbage collector. just call dispose and free it.

like... using (var sr = new StreamReader(fileName)) {...}

And... try/catch clause that eat the exception, the exception can be cleanup up, no need for GC.

And... I can't count how many times I've called string.Split() on a massive amount of data. It's a shame all those arrays allocated and garbage collected unnecessarily.

Finally... As Roslyn is written in C# and needs to parse thousands of text files, I bet it creates many millions of small objects on the heap that can be collected immediately after they are tokenized and go out of scope.

To name just a few.

SunnyWar avatar Apr 11 '16 03:04 SunnyWar

Also, almost all code that has a "using" statement has an object that goes out of scope can be cleaned up immediately. No need to wait for the garbage collector. just call dispose and free it.

using simply calls the IDisposable interface. It does not provide any guarantee about object lifetime. Inside the using block, the object could be assigned to objects/variables that would maintain its 'liveness' and its perfectly valid to do so despite it being 'disposed'.

This is the reason the GC has to run to finalize the object. It has to trace to determine if an object reference is still held anywhere. This is also the point where 'escape analysis' enters the discussion.

GC.SuppressFinalizer() in Dispose() should make object clean up more efficent by eliminating a call to the dtor() during GC.

OtherCrashOverride avatar Apr 11 '16 20:04 OtherCrashOverride

@OtherCrashOverride sure, it's possible that the object is assigned to something inside the using, and in this case I would expect the compiler to see this, be conservative, and not free it. However, I claim the 80% case is this does not happen. Now, perhaps you are talking about the case where there are nested using statements, which is pretty common. In this case, I hope the compiler would be smart enough to unwind the stack and see that after Dispose all the objects can be freed immediately.

SunnyWar avatar Apr 11 '16 20:04 SunnyWar

It seems to me that escape analysis and call depth can get pretty complicated. I recommend that this feature be added in a super-simple way that lays the groundwork for a more sophisticated approach later.

With that in mind it seems to me a very simple case study is String.Split. It returns an allocated array. It's used in a lot of code. The array does not contain anything that requires depth analysis. It represents a very simple example of an object that often is not used outside the scope of the function.

If the array returned by Split can be freed early, the same logic will apply to a lot of other objects.

SunnyWar avatar Apr 11 '16 21:04 SunnyWar

@SunnyWar Might be similar analysis already done for Ref returns and Locals? https://github.com/dotnet/roslyn/pull/8030

benaadams avatar Apr 11 '16 21:04 benaadams

@SunnyWar How could an unknown-sized array returned from a method be stack allocated? Unless Split (and SplitInternal and say InternalSplitKeepEmptyEntries) is inlined, I don't see how could that be done.

Or are you talking about some "early free from heap" approach and not "stack allocate"?

svick avatar Apr 13 '16 16:04 svick

@svick I'm talking about "early free from heap" though if "stack allocate" accomplishes the same thing then I'm all for it! Every time I see "new" on something simple that has a scoped lifetime I wonder to myself, "Why does this need to be garbage collected at all? Isn't it obvious to even a brain-dead compiler that it is scoped, has no references, and was passed to no fuctions (in othe words: is 100% safe to free)? Why not clean it up immediately? I honestly don't get why this is so difficult and wasn't done from the beginning.

That's my motivation.

SunnyWar avatar Apr 13 '16 17:04 SunnyWar

Why not clean it up immediately?

And how do you clean it up immediately? The GC heap doesn't have any mechanism that allows you free a single object. And adding such a mechanism isn't exactly trivial.

mikedn avatar Apr 13 '16 18:04 mikedn

@mikedn

The GC heap doesn't have any mechanism that allows you free a single object.

If the GC has no mechanism to free up a single object, then how does the GC clean up a single object?

{ var foo = myString.Split(','); } ... GC will eventually clean up foo, a single object, with no references, passed to no one, with no finalizer. so why not do it immediately without the GC overhead? What am I missing here?

SunnyWar avatar Apr 13 '16 18:04 SunnyWar

@SunnyWar this capability could be added. If the object to be freed is the last one in the current segment the allocation pointer can just be decremented.

There have been dynamic escape tracking schemes implemented like this. Escape analysis and tracking can use a mix of static and dynamic analysis. It also can work with dynamically sized objects and variable amounts of objects.

I think a pragmatic way to go would be to stack allocate in obvious cases. Sometimes it really is trivial to prove that no pointer escapes and that the size of all objects affected is bounded.

GSPP avatar Apr 13 '16 18:04 GSPP

GC will eventually clean up foo, a single object, with no references, passed to no one, with no finalizer

GC does not work like that, it doesn't free single objects. It basically moves around all the objects that are in use to eliminate the free space that may have accumulated between them. What you suggests requires maintaining free lists and doing that in a GC allocator is a bit of a problem. Not because it is impossible or difficult but because not having to maintain such lists is one of the few advantages a GC allocator has over a classic allocator (though the CLR GC does maintain free lists in certain cases).

mikedn avatar Apr 13 '16 18:04 mikedn

@mikedn Thanks for the explanation. It's enlightening. I'm less interested in how difficult it is than when it will be done. Though now I see why the discussion is about stack allocation.

SunnyWar avatar Apr 13 '16 19:04 SunnyWar

I'm happy that so many people are interested in my idea. Please have a look at another idea:https://github.com/dotnet/coreclr/issues/555, I think it's more interesting.

ygc369 avatar Apr 14 '16 01:04 ygc369

Can this improve perfomance of foreach statement? In some cases, foreach statement causes small object allocation:

using System;
using System.Collections.Generic;

class Program
{
    static void Main()
    {
        const int N = 10000;
        var list = new List<int> { 1, 2, 3, 4, 5 };

        Console.WriteLine("total memory:");

        Console.WriteLine(GC.GetTotalMemory(false));

        for (int i = 0; i < N; i++)
        {
            // no boxing. struct List<int>.Enumerator is used for iteration.
            var s2 = 0;
            foreach (var x in list)
                s2 += x;
        }

        Console.WriteLine(GC.GetTotalMemory(false));

        for (int i = 0; i < N; i++)
        {
            // list is boxed to IEnumerable
            var s3 = ViaInterface(list);
        }

        Console.WriteLine(GC.GetTotalMemory(false));

        for (int i = 0; i < N; i++)
        {
            // list is not boxed, but List<int>.Enumerator is boxed to IEnumeartor<int>
            var s4 = ViaGenerics(list);
        }

        Console.WriteLine(GC.GetTotalMemory(false));
    }

    static int ViaInterface(IEnumerable<int> items)
    {
        var s = 0;
        foreach (var x in items)
            s += x;
        return s;
    }

    static int ViaGenerics<T>(T items)
        where T : IEnumerable<int>
    {
        var s = 0;
        foreach (var x in items)
            s += x;
        return s;
    }
}

ufcpp avatar Oct 27 '16 04:10 ufcpp

Can this improve perfomance of foreach statement?

Not really, this was mentioned in some posts above that discussed LINQ. Not only that it is difficult to stack-allocate returned objects but you won't even reach that far because you can't perform escape analysis on MoveNext() and Current because they're interface calls.

It may work if ViaInterface is inlined, the JIT discovers that IEnumerable<T> is in fact List<T> and then it can inline GetEnumerator but that's a lot of code to inline.

mikedn avatar Oct 27 '16 04:10 mikedn

To the question about use cases: There are certainly widely spread use cases where avoiding these small allocations can be very useful. Let's say I am doing game development in C# in Unity. In methods that run many times per second, it is very important to avoid allocations. These small garbage allocations would accumulate quickly and cause GC collection to run at some undetermined time during gameplay and cause perceptible hick-ups. This can sometimes be eased by using temporary variables, re-assigned instead of re-allocation within the game loop. However, there are still issues: sometimes other methods called are not allocation-free, and sometimes allocations happen by mistake e.g. up until recent versions every use of foreach loop would generate garbage because Unity used outdated version of Mono C# compiler link . Even with the new version there are cases of "unjust" allocations on the heap, and some features are discouraged (e.g. Linq). The worst are the strings - the immutability makes them generate garbage fairly frequently. Value types are not a solution in most cases: they don't support inheritance and this makes them quite prohibitive (personal opinion). In any case, to me this would be much more welcome than ref returns for example. For above mentioned reasons I would not be able to take advantage of this in Unity immediately, but this is something to be looked forward to in the future.

Summary: I would really like to see this happening.

sirgru avatar Dec 29 '16 18:12 sirgru

There are a lot of 'params' methods in the hot path. String.Trim(), for example, is often used. It does not allocate unless it actually changes the string, yet must heap allocate an array of chars.

The situation is a bit more complicated than that. Java doesn't have value types and as such it needs allocation optimizations more than .NET needs.

APIs involving mutable structs (C# doesn't have immutable ones) are incredibly error prone. There's no clarity into what copy of a struct is being mutated by what method. structs are therefore eschewed in favor of classes in public APIs.

I think it's an oversimplification to imply .NET doesn't need allocation optimizations as much as Java does.

lilith avatar Dec 31 '16 03:12 lilith

Pardon my jumping in, but what's the status of this?

realvictorprm avatar Aug 21 '17 09:08 realvictorprm

On this example, which I believe illustrates the issue, I get a factor of ~15 between C# and Java. Also, if you change from class to struct, C# and Java performance become similar.

`namespace ConsoleApplication1

{

class Program

{
    static void Main(string[] args)
    {
        long t = System.Environment.TickCount;
        int l = 123456;
        Test1[] ts = new Test1[l];
        for (int i = 0; i < l; ++i)
        {
            ts[i] = new Test1(i, i.ToString("X8"), new Test2(i * 123, i * 789), new Test2(i * 7, i * 9));
        }
        for (int n = 0; n < 600; n++)
        {
            for (int i = 0; i < l; ++i)
            {
                ts[i] = m1(ts[i]);
            }
        }
        System.Console.Out.WriteLine("T1: " + ((System.Environment.TickCount - t) / 1));
    }
    static Test1 m1(Test1 t1)
    {
        Test1 t2;
        Test1 t3;
        t2 = new Test1(t1.p1 + 1, t1.p2, t1.p3, t1.p4);
        if (t1.p1 % 2 == 0)
        {
            t3 = new Test1(t2.p1 * 2, t2.p2, new Test2(t2.p3.p1 + 1, t2.p3.p2 + 2), new Test2(t2.p4.p1 * 2, t2.p4.p2 - 1));
        }
        else
        {
            t3 = t2;
        }
        return new Test1(t3.p1 % 4579, t3.p2, new Test2(t3.p3.p1 % 456789, t3.p3.p2 % 789456), new Test2(t3.p4.p1, t3.p4.p2));
    }
}
public class Test1
{
    public Test1(int p1, string p2, Test2 p3, Test2 p4)
    {
        this.p1 = p1;
        this.p2 = p2;
        this.p3 = p3;
        this.p4 = p4;
    }
    public int p1;
    public string p2;
    public Test2 p3;
    public Test2 p4;
}
public class Test2
{
    public Test2(long p1, long p2)
    {
        this.p1 = p1;
        this.p2 = p2;
    }
    public long p1;
    public long p2;
}

} `

mvanassche avatar Sep 22 '17 08:09 mvanassche

FYI, @echesakovMSFT did some work on this a while back; figured I'd link up the parts here.

dotnet/coreclr#6653 refactored the jit to create a framework where one can plug in an escape analysis. There are placeholders for the escape analysis and for the expansion of an object allocation into a stack allocation.

Egor's fork has some implementations of the analysis and transformation.

AndyAyersMS avatar Jan 24 '18 23:01 AndyAyersMS

@AndyAyersMS If JIT does escape analysis, would the program slow down? I think that it would be better if the compiler could do escape analysis.

ygc369 avatar Jan 25 '18 03:01 ygc369

There are various tradeoffs involved in determining where and how one might do escape analysis.

The compiler (and here I presume by compiler you mean CSC) has limited visibility into other methods. In particular it often sees reference assemblies and so can't really do any sort of cross-assembly analysis.

The jit sees the actual implementations of all methods and so is able to look across assemblies. But in practice it only does this for methods that it tries to inline, and it's not clear if it is worth speculatively inlining a method just to see if it helps refine escape analysis. As you point out the jit may not have the time to do a thorough analysis. That might change with the advent of tiered jitting. However it seems quite possible that relatively simple escape analyses can get most of the cases that can be gotten and trying to get too fancy here rapidly has diminishing returns (see interprocedural alias analysis). And there may be ways for the jit to piece together interprocedural analyses, but quite often (in the absence of tiering) the jit compiles callers before callees so the benfits may only be seen once tiering is a bit more mature.

There is a spot in between that is occupied by the IL linker and it might be an interesting place to experiment with something like this too.

AndyAyersMS avatar Jan 25 '18 03:01 AndyAyersMS

There is a spot in between that is occupied by the IL linker and it might be an interesting place to experiment with something like this too.

@AndyAyersMS Maybe an IL optimizer could do interprocedural analysis and annotate (probably with an attribute) method parameters that do not escape. When the JIT compiles method A that calls method B it can simply look at the parameter annotation of method B, without having to actually analyze method B. And to ensure type safety the JIT could do its own escape analysis when compiling B and fail compilation if it turns out that B escapes parameters that have been annotated by the IL optimizer. That might be a slightly tricky part as both the JIT and the IL optimizer need to use the exact same escape algorithm, otherwise each may produce different annotations.

I suspect that could also be done only on corelib, not the entire program, and still be useful. For example, string.Format does not escape its parameters but the code is rather complex/large for the JIT to analyze it in advance when a method calls string.Format.

mikedn avatar Jan 25 '18 06:01 mikedn

@AndyAyersMS How much would simple escape analysis help with F# and Linq allocations?

wanton7 avatar Jan 25 '18 17:01 wanton7

How much would simple escape analysis help with F# and Linq allocations?

This has mentioned before. It's pretty difficult to deal with things like LINQ because there you call methods that have to allocate and return object references. Those methods have no way of knowing who their callers are and if the callers escape the returned object references or not.

mikedn avatar Jan 25 '18 19:01 mikedn

And what's with F#?

Am 25.01.2018 8:06 nachm. schrieb "mikedn" [email protected]:

How much would simple escape analysis help with F# and Linq allocations?

This has mentioned before. It's pretty difficult to deal with things like LINQ because there you call methods that have to allocate and return object references. Those methods have no way of knowing who their callers are and if the callers escape the returned object references or not.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dotnet/coreclr/issues/1784#issuecomment-360567208, or mute the thread https://github.com/notifications/unsubscribe-auth/AYPM8NMXU1nZbVqVucOSWRxC8VpDN0YAks5tONC2gaJpZM4GP8ci .

realvictorprm avatar Jan 25 '18 19:01 realvictorprm

And what's with F#?

Well, I'm not really familiar with F# so I can't comment much on it. But the language doesn't matter too much, except perhaps the fact some language may use some coding patterns more than others.

In general, if a method allocates an object and the reference to the object does not escape then the object can be allocated on the stack. One way to escape the reference is to return it from the method.

So if you have a function that returns a tuple (System.Tuple) in F# then stack allocation isn't an option.

mikedn avatar Jan 25 '18 19:01 mikedn

@AndyAyersMS Do you know whether the .NET Native compiler supports escape analysis? It's an AOT compiler and can see the actual implementations of all methods.

ygc369 avatar Jan 26 '18 02:01 ygc369

I don't believe it does escape analysis. @davidwrighton might know more though.

AndyAyersMS avatar Jan 27 '18 22:01 AndyAyersMS

Is there something that could be done for F# to help ease GC pressure? Like converting small Tuples and Option type to a struct. Is that something JIT or IL optimizer (if it can be used with F#) could do?

wanton7 avatar Jan 29 '18 08:01 wanton7

Is there something that could be done for F# to help ease GC pressure? Like converting small Tuples and Option type to a struct

In F# 4.1 there are: Struct Tuples

// Creating a new struct tuple.
let origin = struct (0,0)

// Take struct tuples as arguments to a function and generate a new struct tuple.
let getPointFromOffset ((x,y): struct (int*int)) ((dx,dy): struct (int*int)) = 
    struct (x + dx, y + dy)

// Pattern match on a struct tuple.
let doAMatch (input: struct (int*int)) =
    match input with
    | struct (0,0) -> sprintf "The tuple is the origin!"
    | struct (_,_) -> sprintf "The tuple is NOT the origin!"

Struct Records

// Regular record type
type Vector3 = { X: float; Y: float; Z:float }

// Same record type, but now it's a struct
[<Struct>]
type StructVector3 = { X: float; Y: float; Z:float }

Struct Unions (Single Case)

// Regular Single Case Union
type EmailAddress = EmailAddress of string

// Struct version of the above
[<Struct>]
type StructEmailAddress = EmailAddress of string

benaadams avatar Jan 29 '18 09:01 benaadams

@benaadams Maybe automatic conversion of small classes that meet certain criteria to structs is just a pipe dream :) F# Core's Option type in is still a class and it would be a breaking change to change that. Some of the F# compiler optimizations don't work for struct tuples. F# compiler can sometimes optimize normal tuples away. So it's suggested to use normal tuples and optimize allocation hotspots with struct keyword. You would also have to use struct keyword everywhere in your code and F# uses tuples a lot. Maybe this something could be fixed by F# compiler by having same optimizations for struct tuples and fproj setting that would default project to struct tuples.

I also think even simple escape analysis similar to golang's would help to allocate some of the Tuple parameters from stack if you don't return that same tuple.

But escape analysis wouldn't help if Option type is used with something like Record type fields. Option type's None case is presented by null, so current situation isn't that bad because it's only allocated when it contains a value. Maybe this is something that F# will fix with a breaking change in the future, we shall see.

wanton7 avatar Jan 29 '18 10:01 wanton7

@wanton7 One step towards "automatic" conversion would definitely be the escape analysis.

Considering this would be accepted, what would be the first step towards an implementation? Large specs?

realvictorprm avatar Mar 02 '18 11:03 realvictorprm

Hmm, Egor's fork seems to be dead. Trying to see where the code went. @echesakov ?

AndyAyersMS avatar Apr 27 '18 20:04 AndyAyersMS

@AndyAyersMS This should work now https://github.com/echesakovMSFT/coreclr/tree/StackAllocation

echesakov avatar Apr 27 '18 21:04 echesakov

Thanks!

AndyAyersMS avatar Apr 27 '18 22:04 AndyAyersMS

@AndyAyersMS @echesakovMSFT Do you know when we would have official escape analysis?

ygc369 avatar Apr 28 '18 01:04 ygc369

@echesakovMSFT Your stack allocation branch seems dead, why not continue working on it? So many people want it.

ygc369 avatar Apr 28 '18 01:04 ygc369

Egor restored the branch and I updated it to build versus master here: master...AndyAyersMS:StackAllocation. Also started looking into what it would take to stack allocate a simple delegate that is directly invoked. Still work to do here as the error path also causes the delegate to escape.

using System;

class X
{
    public static int Main()
    {
        int a = 100;
        Func<int> f = () => { return a; };
        return f();
    }
}

Note somewhat paradoxically a lambda with no local captures compiles into a more complex sequence as the C# compiler caches the both the closure instance and the delegate in statics. This is a good idea for reducing allocations but it means that the jit will have a much harder time reasoning about the code.

// Very simple delegate. Static caching makes flow analysis harder
using System;

class X
{
    public static int Main()
    {
        Func<int> f = () => { return 100; };
        return f();
    }
}

As far as when or if this ability might be added to the jit, I can't say. I would like to sketch out more of the work involved to try and better understand the key problems that need to be solved.

Feel free to pick up this fork and play around with it. I'm not sure how correct it is when it actually does decide to allocate something on the stack.

AndyAyersMS avatar Apr 28 '18 03:04 AndyAyersMS

Note somewhat paradoxically a lambda with no local captures compiles into a more complex sequence as the C# compiler caches the both the closure instance and the delegate in statics.

Would that mean a method group conversion; which C# still doesn't cache https://github.com/dotnet/roslyn/issues/5835, actually has an advantage for stack allocation?

using System;

class X
{
    public static int Function() => 100;

    public static int Invoke(Func<int> f) => f();
	
    public static int Main()
    {
        var result = Invoke(Function); // always allocates
        result += Invoke(() => Function()); // doesn't allocate
        return result;
    }
}

benaadams avatar Apr 28 '18 03:04 benaadams

I suppose so, though it is a funny kind of advantage.

There is a natural tension in places between doing optimizations very early, when generating IL, or waiting until later and trying to do them in the JIT. Often times doing the former makes the latter more difficult. We see this here for the noncapturing case in various ways -- the use of statics, cctors, and even the use of dup to pend a value to the stack across control flow.

The .Net ecosystem as a whole is better off with caching closures and delegates where it is viable -- there are many different execution engines with varying capabilities, and caching will greatly reduce allocation costs for all of them. But it may mean we need some extra information passed to the jit so it can safely recognize these patterns, if we care to optimize them further.

AndyAyersMS avatar Apr 28 '18 18:04 AndyAyersMS

Been having a look at where method group conversions occur in coreclr and they are mostly then stored in a heap variable (e.g. param to delegate .ctor, that then gets passed somewhere else) - so you are right caching would probably be the more effective approach as they wouldn't be viable as stack only 😢

benaadams avatar Apr 28 '18 23:04 benaadams

Brief update:

  • importer doesn't know that some temps are inherently non-null, so can't fold away the null this in delegate check. This fact eventually surfaces during optimizations but this is too late for current placement of escape analysis. I have a prototype fix for this and now we see the delegate does not escape.
  • the closure (which is only referenced by the delegate) still escapes. Not clear how to avoid this unless it is safe to assume that if the delegate does not escape it does not allow its closure to escape, or we manage to inline the delegate invoke and see that the closure does not escape.
  • we blow up when generating the block init for the delegate because of a bug I introduced updating Egor's branch to latest. Have a fix.
  • With importer null folding and that fix that we generate code like the following:
; Assembly listing for method X:Main():int
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 tmp0         [V00,T01] (  3,  6   )     ref  ->  rax         class-hnd exact non-null
;  V01 tmp1         [V01,T00] (  4,  8   )     ref  ->  rsi         class-hnd exact non-null
;* V02 tmp2         [V02    ] (  0,  0   )    long  ->  zero-ref
;  V03 tmp3         [V03,T03] (  3,  3   )  struct (72) [rsp+0x20]   do-not-enreg[SFB] must-init
;  V04 OutArgs      [V04    ] (  1,  1   )  lclBlk (32) [rsp+0x00]
;  V05 rat0         [V05,T02] (  2,  4   )     ref  ->  rsi
;
; Lcl frame size = 104

G_M49993_IG01:
       57                   push     rdi
       56                   push     rsi
       4883EC68             sub      rsp, 104
       488D7C2420           lea      rdi, [rsp+20H]
       B912000000           mov      ecx, 18
       33C0                 xor      rax, rax
       F3AB                 rep stosd

G_M49993_IG02:
       48B90054C964F97F0000 mov      rcx, 0x7FF964C95400
       E84DFD815F           call     CORINFO_HELP_NEWSFAST     // closure still escapes
       C7400864000000       mov      dword ptr [rax+8], 100

;; setup the delegate on the stack

       33D2                 xor      rdx, rdx
       488D4C2420           lea      rcx, bword ptr [rsp+20H]  // delegate on stack
       488911               mov      qword ptr [rcx], rdx
       48BA403CA3C1F97F0000 mov      rdx, 0x7FF9C1A33C40   // delegate method table
       4889542428           mov      qword ptr [rsp+28H], rdx
       488D542420           lea      rdx, bword ptr [rsp+20H]
       488D7208             lea      rsi, [rdx+8]
       488D4E08             lea      rcx, bword ptr [rsi+8]
       488BD0               mov      rdx, rax
       E8E8236C5F           call     CORINFO_HELP_CHECKED_ASSIGN_REF
       48B93817DB64F97F0000 mov      rcx, 0x7FF964DB1738  // function pointer
       48894E18             mov      qword ptr [rsi+24], rcx

;; delegate now pointed to by RSI .. fetch this and call via the function pointer

       488D4E08             lea      rcx, bword ptr [rsi+8]
       488B09               mov      rcx, gword ptr [rcx]
       FF5618               call     qword ptr [rsi+24]System.Func`1[Int32][System.Int32]:Invoke():int:this
       90                   nop

G_M49993_IG03:
       4883C468             add      rsp, 104
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret

Seems like a next step is to try and run the stack conversion earlier so that we can promote the delegate struct (will require some tricks as it is too big to promote right now) and propagate the field values. Then turn the call from indirect to direct. Then it would be really nice to inline the call... but that will require a bunch more work.

If/when we gain the ability to inline we likely discover that some locals previously though to escape no longer do so. So one thought is to push stack allocation opts early and do it the same time as inlining. But then we'd lose out on struct promotion and propagation. The other tack would be to revise inlining so we could incrementally inline later on. But we would need to do that before morph. So feels like there are bunch of phase ordering problems to sort through here.

AndyAyersMS avatar Apr 30 '18 23:04 AndyAyersMS

Did some prototyping on top of the changes above to get the delegate to be promotable. Lots of things to sort through on the way:

  • Updated the object allocation phase so it can run before morph; required understanding escape behavior of GT_FIELD (which is only partially correct in my branch) and hacking fgMorphIntoHelperCall to provide a variant that can run before morph.
  • Likewise the IR checker doesn't expect GT_FIELD and since the object allocator is a phase I had to provide a way to suppress phase checks.
  • Also the phase was asserting that dominance was computed but never uses the dominance info. So commented out this assert.
  • With all that in place, moved the phase to just after inlining. But the delegate type was not promoted.
  • Started working on promotion. Multicast delegate has 6 fields and jit won't promote any struct with more than 4, so upped the limit for delegates.
  • Had to generalize field discovery since ref types "inherit" their parent class fields. This required a bunch of rework in lvaCanPromoteStructType to properly traverse parent class hierarchy.
  • Got that sorted and now the delegate was promotable, but we never registered any of the field accesses so it was not promoted. The field access tracking doesn't handle newobj ref class field accesses (not surprisingly). So added another "is delegate" bypass to allow promotion without evidence of field use.
  • Delegate struct now gets promoted but because it is address exposed the promotions are dependent and (I presume) we don't try and copy prop through the fields.

Current jit dump: (same code as above but with promoted fields):

; Assembly listing for method X:Main():int
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 tmp0         [V00,T01] (  3,  6   )     ref  ->  rax         class-hnd exact non-null
;  V01 tmp1         [V01,T00] (  4,  8   )     ref  ->  rsi         class-hnd exact non-null
;* V02 tmp2         [V02    ] (  0,  0   )    long  ->  zero-ref
;  V03 tmp3         [V03    ] (  3,  3   )  struct (72) [rsp+0x20]   do-not-enreg[XSFB] must-init addr-exposed
;  V04 tmp4         [V04    ] (  3,  3   )     ref  ->  [rsp+0x28]   do-not-enreg[X] addr-exposed V03.??(offs=0x08) P-DEP
;  V05 tmp5         [V05    ] (  3,  3   )     ref  ->  [rsp+0x30]   do-not-enreg[X] addr-exposed V03.??(offs=0x10) P-DEP
;  V06 tmp6         [V06    ] (  3,  3   )    long  ->  [rsp+0x38]   do-not-enreg[X] addr-exposed V03.??(offs=0x18) P-DEP
;  V07 tmp7         [V07    ] (  3,  3   )    long  ->  [rsp+0x40]   do-not-enreg[X] addr-exposed V03.??(offs=0x20) P-DEP
;  V08 tmp8         [V08    ] (  3,  3   )     ref  ->  [rsp+0x48]   do-not-enreg[X] addr-exposed V03.??(offs=0x28) P-DEP
;  V09 tmp9         [V09    ] (  3,  3   )    long  ->  [rsp+0x50]   do-not-enreg[X] addr-exposed V03.??(offs=0x30) P-DEP
;  V10 OutArgs      [V10    ] (  1,  1   )  lclBlk (32) [rsp+0x00]
;  V11 rat0         [V11,T02] (  2,  4   )     ref  ->  rsi
;
; Lcl frame size = 104

G_M49993_IG01:
       57                   push     rdi
       56                   push     rsi
       4883EC68             sub      rsp, 104
       488D7C2420           lea      rdi, [rsp+20H]
       B912000000           mov      ecx, 18
       33C0                 xor      rax, rax
       F3AB                 rep stosd

G_M49993_IG02:
       48B900549E60F97F0000 mov      rcx, 0x7FF9609E5400
       E84DFD825F           call     CORINFO_HELP_NEWSFAST
       C7400864000000       mov      dword ptr [rax+8], 100
       33D2                 xor      rdx, rdx
       488D4C2420           lea      rcx, bword ptr [rsp+20H]
       488911               mov      qword ptr [rcx], rdx
       48BA403CAABCF97F0000 mov      rdx, 0x7FF9BCAA3C40
       4889542428           mov      qword ptr [rsp+28H], rdx
       488D542420           lea      rdx, bword ptr [rsp+20H]
       488D7208             lea      rsi, [rdx+8]
       488D4E08             lea      rcx, bword ptr [rsi+8]
       488BD0               mov      rdx, rax
       E8E8236D5F           call     CORINFO_HELP_CHECKED_ASSIGN_REF
       48B93817B060F97F0000 mov      rcx, 0x7FF960B01738
       48894E18             mov      qword ptr [rsi+24], rcx
       488D4E08             lea      rcx, bword ptr [rsi+8]
       488B09               mov      rcx, gword ptr [rcx]
       FF5618               call     qword ptr [rsi+24]System.Func`1[Int32][System.Int32]:Invoke():int:this
       90                   nop

G_M49993_IG03:
       4883C468             add      rsp, 104
       5E                   pop      rsi
       5F                   pop      rdi
       C3                   ret

; Total bytes of code 120, prolog size 20 for method X:Main():int

So... where to go from here.

First we likely need to inline the call to Invoke earlier. It currently happens in lower though I do not see any reason why it it can't be done sooner. That in turn might let us remove address exposure of V03 as V01 (the pointer to the delegate) will now be unused -- though we would only see that if we forward substituted the initial value for V01 (which is &V03 + 8) to its use(s). We might get tripped up to non-field stores to V03 (eg the vtable is not backed by a field, nor is the preheader). But if we can get that far then that should allow independent promotion of the fields and hopefully let us remove the unused delegate fields (which there are many) and propagate the closure pointer down to the call site. Then we might be set to also propagate the method pointer there and turn the call from indirect to direct.

And than if we somehow can pull off late inlining we can see the closure is not exposed either...

Note all this talk of "propagation" is premature as this early in the jit we have no way currently of doing anything like this.

AndyAyersMS avatar May 02 '18 01:05 AndyAyersMS

After a good deal more wrangling things are starting to look somewhat respectable.

  • Moved inlining of Invoke into the main inlining phase. Accept calls to Invoke as inline candidates and work around the fact that they don't have IL or method info. Manually expand them and then declare them (for now) as failed inlines.
  • Inject pseudo fields for preheader and method table of the promoted ref class, so that the delegate is tiled by its fields and accesses can be done via fields.
  • Fix overestimating size of the stack needed, we were double counting the preheader
  • The reference point shift to bias the new "this" pointer past the preheader was causing grief. Eventually I decided that any object that could be stack allocated probably would not need a preheader. This rules out objects used for locking and some hash / string properties but I think it's a reasonable compromise.
  • Since we are assuming newobj temps are single-def already, we can rewrite their appearances when we stack allocate to be address-of-stack-local. So we keep track of the newobj temp / stack local pairs and do this rewrite if we stack allocate anything
  • Given that rewrite, and making sure to use pseudo-fields for the delegate invoke call references to the delgate, we now have all touches of the stack allocated delegate using fields based on the address of the stack local and no more pointer math. And that lets us realize the stack local is not address exposed and allows per-field optimization.

The resulting method is now:

; Assembly listing for method X:Main():int
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 tmp0         [V00,T00] (  3,  6   )     ref  ->  rax         class-hnd exact non-null
;* V01 tmp1         [V01    ] (  0,  0   )     ref  ->  zero-ref    class-hnd exact non-null
;* V02 tmp2         [V02    ] (  0,  0   )    long  ->  zero-ref
;* V03 tmp3         [V03    ] (  0,  0   )  struct (56) zero-ref
;* V04 tmp4         [V04    ] (  0,  0   )    long  ->  zero-ref    V03.??(offs=0x00) P-INDEP
;  V05 tmp5         [V05,T01] (  3,  3   )     ref  ->  rcx         V03.??(offs=0x08) P-INDEP
;* V06 tmp6         [V06    ] (  0,  0   )     ref  ->  zero-ref    V03.??(offs=0x10) P-INDEP
;* V07 tmp7         [V07,T02] (  0,  0   )    long  ->  zero-ref    V03.??(offs=0x18) P-INDEP
;* V08 tmp8         [V08    ] (  0,  0   )    long  ->  zero-ref    V03.??(offs=0x20) P-INDEP
;* V09 tmp9         [V09    ] (  0,  0   )     ref  ->  zero-ref    V03.??(offs=0x28) P-INDEP
;* V10 tmp10        [V10    ] (  0,  0   )    long  ->  zero-ref    V03.??(offs=0x30) P-INDEP
;  V11 OutArgs      [V11    ] (  1,  1   )  lclBlk (32) [rsp+0x00]
;
; Lcl frame size = 40

G_M49993_IG01:
       4883EC28             sub      rsp, 40

G_M49993_IG02:
       48B90054DD60F97F0000 mov      rcx, 0x7FF960DD5400
       E85DFD805F           call     CORINFO_HELP_NEWSFAST
       C7400864000000       mov      dword ptr [rax+8], 100
       488BC8               mov      rcx, rax
       48B83817EF60F97F0000 mov      rax, 0x7FF960EF1738
       FFD0                 call     rax
       90                   nop

G_M49993_IG03:
       4883C428             add      rsp, 40
       C3                   ret

; Total bytes of code 47, prolog size 4 for method X:Main():int

So we allocate and initialize the closure, then invoke a method on it (still indirectly).

Plausible next steps:

  • propagate the function address into the call and make it a direct call
  • leave the inline delegate invoke as a residual inline candidate
  • selectively re-invoke inlining (hopefully we can fill in the details) if we do such a conversion
  • (super stretch goal) if arguments to inlinee were escaped local allocations, rerun escape analysis and see if they now no longer escape (will be true above)
  • rerun stack allocation to convert them over (or if unused, just remove?)

Am going to continue on in prototype mode for a while here and then circle back and try to describe what steps we'd actually take if we want to pursue this sort of thing for real.

Prototype bits are here master...AndyAyersMS:NonNullPlusStackAlloc in case anyone's curious what this looks like. Not guaranteed to work on any example other than the above.

AndyAyersMS avatar May 03 '18 01:05 AndyAyersMS

Fixed some issues enabling this by default for crossgen of corelib. Note I don't see any stack allocation conversions in corelib (Egor's notes indicated he was seeing some). Not sure why just yet, either I've introduced a bug or the patterns we use to find are no longer there.

We will not see conversions when crossgenning in R2R mode because the newobj in R2R is more abstract and if we converted we'd introduce a hard dependence on ref class size and layout and make R2R less resilient.

So perhaps (depending on what is going on in corelib) this is best thought of as a jit-only optimization.

AndyAyersMS avatar May 03 '18 16:05 AndyAyersMS

Also (note to self) jit-diffs must be run in two steps in two repos as this change modifies the jit interface but does not yet modify the GUID -- so running a stock baseline jit will trigger a guard CFG fault on Windows. Had me puzzled there for a while.

Running jit-diffs (once I got past that puzzle) shows the main impact is from removing null checks from delegate invokes. Not sure this is correct so will need to look at that closely.

AndyAyersMS avatar May 03 '18 16:05 AndyAyersMS

Scratch that, it's not a nullcheck, just folding an address mode. Current jit late delegate expansion causes it to miss a folding opportunity.

Somewhere along the line we lose the magic disassembly display. Current lowering for Invoke is a bit odd -- we still claim it is a direct call but have an indirect gtControlExpr. So likely it can use the method handle from the direct call to get the text.

;;; before
       lea      rcx, bword ptr [rax+8]
       mov      rcx, gword ptr [rcx]
       mov      rdx, rdi
       call     gword ptr [rax+24]System.Func`2[__Canon,__Canon][System.__Canon,System.__Canon]:Invoke(ref):ref:this

;;; after
       mov      rcx, gword ptr [rax+8]
       mov      rdx, rdi
       call     gword ptr [rax+24]

AndyAyersMS avatar May 03 '18 18:05 AndyAyersMS

Magic disassembly relies on the method handle, which lives in a union with the call target address. So we'd need a separate (perhaps debug only) field to tunnel this through. Not worth the trouble right now.

One other note -- you might have wondered why we don't tail call the delegate in any of the examples above. The VM won't let the jit make tail calls from Main so that the debugging experience doesn't get too weird. But if we put the code in a noinline helper method invoked from Main, it should tail call.

AndyAyersMS avatar May 03 '18 18:05 AndyAyersMS

Hello all... this has been on my docket for a while and I am just now finding the time to attend to it. To start, I'd like to send a little thanks to @AndyAyersMS for pointing me to this issue from the comments of this recent great performance article in regards to .NET Core 2.1.

My primary motivator here is simply awareness. As also discussed in the comments of the above article, I have posted a question in regards to .NET Core's performance in this area on StackOverflow that provides my research in this matter, and also highlights how there could be further improvements in the performance for this particular scenario.

Essentially, the typical pattern I am finding in my code reflects very much what the above StackOverflow question presents. I code much in the way of delegates these days, and to me that delegate represents a method call site for lack of a better expression. It would seem that if a method body simply consists of a call to another delegate, that this would be inlined/optimized in some fashion so that the delegate gets called directly rather than having to call the owning/owner's call site first (if that makes sense).

Also, it is worth mentioning I am seeing this pattern in other ways. In fact, I just now finished this great post by @ploeh who provides another similar example of this scenario here. If you scroll up a bit from the latter link, you can see again that the body of the NaturalNumber implementation method is simply a call to another delegate.

So, as a further example in the case that we're describing here, assume that we're calling the NaturalNumber.Nine.Match method, there would be -- by my estimation, which of course could be totally wrong here! -- at least total 20 instructions: 10 that involve calling the "outer" containing Successor.Match (or Zero.Match) method first, and then 10 that involve invoking the actual "inner" delegate.

The optimization that I am pawing at here would essentially remove the 10 "outer" instructions that call the external (Successor.Match or Zero.Match) method and essentially inline the "inner" delegate, remaining with 10 instructions total, essentially improving performance by 50% ... in theory 😆.

Again, this is all based on my limited knowledge in this area so I could totally be wrong in my understanding and subsequent calculations in this regard. I would love feedback here to ensure my understanding is correct and of course any further correction to help align my understanding so that it is better intune with how everything here actually operates.

To summarize, my intention here is simply to provide a little more context and awareness from my perspective, and to raise my hand in a way to say that there might be a way to further optimize this pattern/scenario. I have been seeing it a lot in my personal coding adventures which led to the StackOverflow question as well as with other examples/prescriptions as noted with the blog post above.

Thank you for any and all consideration/conversation towards this!

Mike-E-angelo avatar May 28 '18 15:05 Mike-E-angelo

@Mike-EEE @AndyAyersMS Since this issue is about stack allocation, shouldn't discussion about delegate inlining continue somewhere else (maybe at https://github.com/dotnet/coreclr/issues/6737)? Though I realize the two problems are connected.

svick avatar May 28 '18 16:05 svick

Sure @svick that sounds good. Not much more to discuss from my side -- as TBH this is way outside of my comfort zone ATM -- but more of just sharing some data points and additional context for added perspective. 👍

Probably a safe bet to say, though, that everyone here is probably already maxed out on all the data they already have, I'm guessing. 😆

Mike-E-angelo avatar May 28 '18 16:05 Mike-E-angelo

In the notes above I outline a bunch of engineering challenges to sort out before we even get in the neighborhood of being able to inline delegates. Many of those are interesting and valuable jit improvements in their own right (like stack allocation which is the subect of this issue). So we may spend quite a while working through the prerequisites.

An important thing to emphasize is that is that even if we do all that, and push forward to actually inline delegates, the scope of the optimization is still going to be limited -- the jit must see delegate creation and invocation end up in the same method before it can optimize. We can enhance that scope via inlining but there are limits there too.

So the likely outcome is that with all that work, only some subset of delegate invocations will get inlined (sort of like where we are now with devirtualization). We can tell already that a number of popular compositional patterns -- say like the way LINQ uses delegates, and the examples you point out -- are not really amenable to optimization this way.

But all hope is not lost. The way to get around these limitations is to start making speculative bets. And that ties back into tiered jitting and taking advantage of feedback from the running application.

So going forward we have two main avenues to explore: improving the ability of the jit to optimize when it can see all the pieces of the puzzle, and relying on feedback to make intelligent guesses when it can't.

AndyAyersMS avatar May 29 '18 07:05 AndyAyersMS

Are we talking about true stack allocation or are we talking about scalar replacement, as the JVM does?

Also, would it be possible to implement a separate thread-local heap allocator that could be used in place of the global GC for object trees determined to be thread-local and non-escaping in a particular scope? The JIT or AOT could then deterministically free those object trees when they go out of scope without adding to GC pressure, and maybe even be able to support additional scenarios that scalar replacement or stack allocation couldn't (for example being able to pass such an object to a function). Assuming it's possible to do the necessary escape analysis through a method call.

JeroMiya avatar Mar 19 '20 13:03 JeroMiya

Are we talking about true stack allocation

yes

or are we talking about scalar replacement

The JIT currently only does this for structs, it's called struct promotion

Assuming it's possible to do the necessary escape analysis through a method call.

That is the primary problem: https://github.com/dotnet/runtime/issues/1661#issuecomment-573781871

Suchiman avatar Mar 19 '20 13:03 Suchiman

Interesting. What about compiler hinting? Could the C#/F#/VB compilers optionally do more extensive analysis than the JIT is capable of doing, then pass the results of the analysis to the JIT and/or AOT engines?

I hesitate to suggest it, because I'm hoping for a solution that wouldn't require any changes to existing code to take advantage of, but what about a language-level hint that the compiler could enforce? A while ago I toyed with the idea of a "local" keyword for this purpose (disclaimer: not a fully baked idea/proposal, just a random thought):

static MyClass staticObj;

class MyClass {
    public int Field { get; set; }
    [NoEscape]
    public void Mutate() {
        // staticObj = this; // <-- NOT allowed, because of [NoEscape]
        Field = Field * 2;
    }
}

local MyClass MethodWithLocalArgAndReturn(local MyClass param)
{
    // staticObj = param; // <-- NOT allowed
    local var localObj = param; // allowed to assign local to local
    localObj.Mutate(); // allowed, Mutate is marked with [NoEscape]
    return localObj; // allowed because return value is declared as local
}

void Method() {
    local var localObj = new MyClass();
    MethodWithLocalArgAndReturn(localObj); // Allowed because argument is local
    // localObj goes out of scope, AOT/JIT can be sure no references exist after that point
    // can optimize accordingly
}

This is not my preferred solution - I'd rather have the compiler and JIT/AOT do the work for us and optimize accordingly.

Also, worth considering that the types of applications that would need this optimization would be good candidates for AOT or slower-to-compile/faster-to-run style optimizations.

JeroMiya avatar Mar 19 '20 14:03 JeroMiya

I think that C# compiler should do simple escape analysis, thus the generated IL code would have already been optimized, so JIT need not do this expensive work. Besides, AOT compiler should do full escape analysis, because it can see all the code to compile.

ygc369 avatar Mar 21 '20 08:03 ygc369

@AndyAyersMS @jkotas I heard that Microsoft teams have stopped escape analysis project. Is there an opportunity to restart it? If it is so difficult, I wonder how JVM implement it.

ygc369 avatar Mar 27 '20 13:03 ygc369

There are some important differences between the JVM and the CLR, and these differences influence the relative priorities of optimizations.

I think it's likely we'll resume working on escape analysis someday, but there are some enabling technologies we need to work on first.

AndyAyersMS avatar Mar 27 '20 18:03 AndyAyersMS

@AndyAyersMS Thanks for comment. But what are these enabling technologies? Could you give an example? Besides, since fully automatic escape analysis is so difficult, I suggest to provide some syntax to allow manually allocating objects from stack. For example,

class Foo
{
    private int[] arr;
    Foo(int n) { arr = new int[n]; }
}
var foo = stackalloc Foo(10); // reuse the syntax "stackalloc" to allocate the object "foo" from stack.
......

What to notice is that the internal array in "foo" should also be stack-allocated (although keyword "new" is used in the constructor), as long as the array does not exceed the stack space left, otherwise it will be still allocated from heap. CLR should do the decision about where to allocate the internal objects inside a stackalloc'ed outer object, according to the stack size. If the stackalloc'ed object is referenced from heap later, then it should be moved (or promoted) to the heap automatically. But this kind of promotion should be reported to the programmers in debug mode to help them find objects which are not suitable for stack allocation.

ygc369 avatar Mar 29 '20 02:03 ygc369

Hmm, not sure if explicit stack allocation of objects is the answer. They would be so limited I wonder whether they'd even be useful for scenarios you'd want to use them for.

JeroMiya avatar Mar 29 '20 05:03 JeroMiya

@AndyAyersMS I think that tiered JIT might be a solution for escape analysis. When the code is compiled by the JIT compiler for the first time, the JIT compiler could execute an aggressive stack allocation strategy, that is, when it can't be decided quickly whether an object should be stack allocated, just assume it is safe to allocate the object on stack. And then, if the stack-allocated object is referenced from the heap, just copy(move) it to the heap. Then, when the tiered JIT compiler does ReJIT, the IL code should be re-generated to directly heap-allocate these objects proved to escape.

ygc369 avatar Aug 23 '20 00:08 ygc369

@ygc369 current thinking is that perhaps we could run escape analysis during an AOT phase where we have more time and larger scope (that is we can do an interprocedural analysis) -- then have the jit rely on this in some manner.

The other thought is to look into partial escape analysis where we "just in time" move objects to the heap if control reaches an escape point; this likely would need some form of profile-based feedback to be done profitably.

AndyAyersMS avatar Aug 25 '20 16:08 AndyAyersMS

@ygc369 @AndyAyersMS Couple of questions:

Can the compiler inject hints into the CIL to reduce the performance impact of escape analysis on the JIT engine?

Secondly, however and whenever the escape analysis runs, does the optimization always have to be in the form of stack-allocated objects or scalar replacement? Would it be possible for the runtime/JIT/AOT to allocate and automatically free non-escaping heap objects and their child references when they go out of scope without triggering a GC on either alloc or free? Kind of like a "scope and thread local heap". While this wouldn't be as efficient as a stack allocation or scalar replacement would it be more flexible and comprehensive in terms of where the optimization could be used? Would there be a performance impact in doing this (e.g. inserting an implicit exception handler to that scope to ensure the object is freed)?

JeroMiya avatar Aug 27 '20 13:08 JeroMiya

@AndyAyersMS is there anything already implemented on escape analysis, in the JIT or somewhere else? I'd like to learn more about it. Can you point me at code/docs/discussions?

ldematte avatar Aug 28 '20 07:08 ldematte

its 2022,with more 3 years,thats will be 10 years,.net team will be do something for .net's ZGC?

We all know that the C++ committee delays and slows down C++ development, but the .net foundation is relatively young and shouldn't be. And GC is just an implementation, not a standard, this shouldn't be delayed too long.

sgf avatar Jul 23 '22 16:07 sgf

its 2022,with more 3 years,thats will be 10 years,.net team will be do something for .net's ZGC?

We all know that the C++ committee delays and slows down C++ development, but the .net foundation is relatively young and shouldn't be. And GC is just an implementation, not a standard, this shouldn't be delayed too long.

I think it's more along the lines of determining whether this is the right solution to the problem. The JVM's optimizations around allocations have a significant impact, but they are limited in ways that a developer won't necessarily have ready insight into. I'm not an expert on GC by any means, but my limited understanding leads me to believe some form of deterministic deallocation mechanism or a more aggressive object pool for temporary variables would be more feasible. Not least because it would allow you to continue passing "temporary" instances to methods that take them as arguments, including methods on the instance itself. After all, an optimization is no good if all the code you normally write falls outside the prerequisites.

JeroMiya avatar Jul 23 '22 16:07 JeroMiya

@sgf I think you might be underestimating how complex this issue is. The GC isn't really the main issue here, it's the escape analysis required to ensure the optimization is safe to preform.

Either way, @AndyAyersMS has indicated in https://github.com/dotnet/runtime/issues/11192#issuecomment-1185855115 that this is potentially on deck to receive more attention after .NET 7.

PathogenDavid avatar Jul 23 '22 16:07 PathogenDavid

@sgf I think you might be underestimating how complex this issue is. The GC isn't really the main issue here, it's the escape analysis required to ensure the optimization is safe to preform.

Either way, @AndyAyersMS has indicated in #11192 (comment) that this is potentially on deck to receive more attention after .NET 7.

If .net teams do not have the technical ability to solve the problem, then learn from the go team and hire experts from outside to solve the problem. Times have changed, and .net's current gc can only handle toy-level applications.

I don't know if this is a typical problem for a super company like Microsoft, but .net is moving too slowly and has missed too many opportunities. Although things are getting better with the current .net it seems to be too late.

The slow motion of .net even affected the development of Unity3d.This is only one of the few areas where .net is still popular. I don't want .net to miss out again and again like WindowsPhone.

My words don't sound good,maybe some people don't like it and are not happy.But that's the truth, isn't it?

sgf avatar Jul 23 '22 17:07 sgf

In order to maintain the illusion of .net prosperity, as long as I see .net projects, I mark almost all of them as star . Although this is trivial, .net open source projects are also part of the .net ecosystem, if only to incentivize them to maintain open source projects.

sgf avatar Jul 23 '22 17:07 sgf

@sgf

If .net teams do not have the technical ability to solve the problem, then learn from the go team and hire experts from outside to solve the problem. Times have changed, and .net's current gc can only handle toy-level applications.

Riiiight, scroll to the Garbage Collection headline

Suchiman avatar Jul 23 '22 18:07 Suchiman

If .net teams do not have the technical ability to solve the problem, then learn from the go team and hire experts from outside to solve the problem. Times have changed, and .net's current gc can only handle toy-level applications.

I don't know if this is a typical problem for a super company like Microsoft, but .net is moving too slowly and has missed too many opportunities. Although things are getting better with the current .net it seems to be too late.

The slow motion of .net even affected the development of Unity3d.This is only one of the few areas where .net is still popular. I don't want .net to miss out again and again like WindowsPhone.

My words don't sound good,maybe some people don't like it and are not happy.But that's the truth, isn't it?

There are many trillions of lines of code written on the .NET platform, in some cases within codebases that are already more than a decade old. The fact that the .NET ecosystem evolves as fast as it does is a testament to the "technical ability" of all the contributors to the .NET ecosystem. I've worked on applications that started out as Windows services running on premise and ended up running in a Kubernetes on a cloud platform. It speaks well of everyone involved that fundamental changes to core pieces, like GC, are done after very careful consideration.

Also, when you get a chance, I kindly suggest reading the Code of Conduct.

JeroMiya avatar Jul 23 '22 18:07 JeroMiya

Hope this can be implemented in .Net 8. @AndyAyersMS

ygc369 avatar Jul 25 '22 13:07 ygc369

I understand that allocating objects on the stack may be vastly more complex than it appears on the surface.

Perhaps a more doable idea is to enable a new type of heap allocation, using a new heap which is not managed by the garbage collector and requires manual allocation and freeing of memory.

We can be smarter than the GC at allocating/freeing memory, this new functionality would enable us to put this knowledge into action.

I'd love to hear people's thoughts on this, any obvious caveats that I've missed?

Zintom avatar Aug 06 '22 22:08 Zintom

@Zintom why do you think a heap allocator would be cheaper? How would the GC detect roots from your heap?

danmoseley avatar Aug 07 '22 06:08 danmoseley

Arena based allocators like "allocate temp heap, do whatever you want within it, remove heap entirely" have the same problem as object stack allocation - there should be an escape analysis done proving that no object refs "escape" to actual heaps of GC.

So I guess you would need

  1. an escape analysis in Roslyn (prove refs won't escape, fail compilation otherwise or else there is no meaning in such allocator)

  2. in JIT (because JIT can't rely on IL compiler's opinion only, it needs to be sure too)

  3. also a mean to allocate a "real heap" objects to save the result of this "off-heap" computation and prove that data is strictly copied there and not referenced.

En3Tho avatar Aug 07 '22 07:08 En3Tho

How would the GC detect roots from your heap?

My idea is that any object referenced by this "unmanaged heap" would be marked as "possibly live, linked to an object in the unmanaged heap" and thus the GC would ignore them.

why do you think a heap allocator would be cheaper?

Considering most of the runtime is built around the idea of "the less allocations the better" in order to avoid GC pressure, I believe this could have massive potential in terms of the BCL and the wider ecosystem in general. I think most intermediate developers and above know how to manage their alloc's/free's. Maybe it wouldn't be used everywhere, but in critical sections of code or in libraries we could see major improvements, the Average Joe wouldn't have to know that an unmanaged heap even existed, but would benefit as his allocations wouldn't have such an impact on his program performance.

Zintom avatar Aug 07 '22 14:08 Zintom

If at all possible, I would prefer GC optimizations that didn't require explicit action by the developer. Ideally they should be transparent to the developer, like strategies where the JIT or AOT engine detects non-escaping allocations and implements optimizations, such as scalar replacement, without the developer needing to explicitly mark a variable as non-escaping or allocating it on a special heap.

There's a lot of existing code out there, including in the BCL, where such optimizations would be beneficial. It's pragmatically a non-starter to require that code to be updated or rewritten to explicitly make use of those optimizations.

My question for the experts would be: would it even be an optimization if hypothetically the runtime could simply deterministically deallocate some non-escaping heap allocations when they go out of scope? I mean, putting scalar replacement aside, if the JIT could deterministically deallocate a temp variable when it goes out of scope. That's not free from a performance perspective, yeah? I imagine it would negatively effect the throughput of the method.

I think the question suggests we should step back and analyze what it is we're actually optimizing. GC pressure is "bad" for a specific type of performance. For example, for real-time graphics applications or animated UI frameworks, a GC pause can cause visual stuttering. For high throughput applications like microservices and other backends, high GC pressure might cause throughput issues if the GC can't scale with the demands being placed on it. How can we solve these specific problems? Is deterministic deallocation for temporary objects really the best solution? I'd love to hear some insights from the experts.

Worth noting that both the real-time graphics/UI applications and high-throughput microservices tend to do relatively short bursts of work on a particular thread or thread pool and then give up control of the thread back to the thread pool or else the thread pauses until it's signaled again. Just a random thought: perhaps a more efficient way to deal with temporary allocations is some kind of thread-local "temp variable GC generation". That is to say, rather than allocation all temporary variables onto gen 0, when the JIT can prove a variable is non-escaping, it can safely allocate that variable onto a separate "gen T" (for Temporary). This pool ONLY contains objects that are non-escaping for a specific thread. Then, whenever a thread gives up control back to a thread pool or goes to sleep, the GC can free up everything in that thread's gen-t pool without a full scan.

JeroMiya avatar Aug 07 '22 16:08 JeroMiya