NullabilityInference icon indicating copy to clipboard operation
NullabilityInference copied to clipboard

Flow analysis

Open dgrunwald opened this issue 4 years ago • 7 comments

Currently our inference re-uses Roslyn's flow-analysis.

However, this has some fundamental problems.

9:	Dictionary<T, Node> mapping = new Dictionary<T, Node>();

11:	Node? GetNode(T element)
	{
13:		Node? node;
14:		if (!mapping.TryGetValue(element, out node))
15:		{
16:			node = new Node();
17:			mapping.Add(element, node);
18:		}
19:		return node;
	}

GetNode

There's no edges created for line 16/17 because here Roslyn knows that node is non-null. Line 14 creates two edges: one from <nullable> because TryGetValue will assign null when returning false; the other from mapping!1#2 (the mapping field's Node type argument) when TryGetValue returns true.

The return statement creates an edge from the variable's type, because Roslyn's flow analysis can't guarantee us that the variable is non-null -- our Roslyn code analysis runs under the pessimistic assumption that all types-to-be-inferred might end up nullable, so it considers mapping to be Dictionary<T, Node?>, which leaves open the possibility that GetNode returns null.

However, after our inference decides that mapping!1#2 is non-null, it would be correct to also indicate that the GetNode return value is non-null. After all, if no node exists yet, the function will create one.

The issue here is that Roslyn's flow analysis isn't aware of our types-to-be-inferred. It would be better if, instead of using Roslyn's flow analysis, we had our own that keeps track of node's nullability.

The idea would be to create additional "helper" graph nodes for the different flow states of a local variable of reference type. After TryGetValue initializes node, it's flow-state would be (true: mapping!1#2, false: <nullable>). Within the if body, the flow-state would initially be <nullable>, but after line 16 would change to <nonnull>. After the if, the flow-state from both alternatives could be re-combined by creating a new node "node-after-line-18" and edges from the nodes from the two if-branches -- in this case <nonnull> from the then-branch and mapping!1#2 from the else branch. Then the return statement would create an edge from this "node-after-line-18" instead of the node for the variable's declared type. All flow-state nodes associated with a variable would have an edge to the variable's declared type node. We'd end up with a graph somewhat like this: GetNode-cfg

Thus in the end, node would be inferred as nullable, but the GetNode return type would only depend on mapping!1#2 and thus can be inferred depending on whether there's a mapping.Add(x, null) access somewhere else in the program.

dgrunwald avatar Jun 08 '20 23:06 dgrunwald

If I've understood correctly, I think a simple solution might be to run multiple passes I.e. Write in current inferred state to the trees, then create a new semantic which now uses the inferred annotations. The naive approach (rerun from scratch except the initial rewrite) would likely be slow. But you can retain the graph, and restart from the first null edges known potentially?

GetSpeculativeSemanticModel could potentially help where only a single referenced related file had changed since the last pass - I'm assuming it's faster but haven't checked. While I'm thinking about it, AnalzeDataFlow is another method I keep meaning to check if would provide anything extra that's useful.

GrahamTheCoder avatar Jun 09 '20 23:06 GrahamTheCoder

That would work in a system that starts with everything as nullable, infers some non-null types, then run multiple passes (until fixed point) finding more non-null types.

It doesn't really work in the current "minimize compiler warnings" system -- Roslyn flow analysis seeing something as non-null causes us to omit edges, which could then cause us to switch a non-null type back to nullable. There's no guarantee the passes would converge to anything.

dgrunwald avatar Jun 10 '20 10:06 dgrunwald

Of course we could keep notes on which types were marked non-null by an earlier run, and leave those fixed in the following passes. That should make the iterations converge. But each iteration computing a local minimum would not necessarily result in a global minimum.

Also, with a custom flow analysis I have an idea that I think will let me infer [NotNullWhen(bool)] attributes on out parameters:

  • in addition to the primary node for the parameter's type, have two additional nodes onTrueReturn/onFalseReturn for out parameters.
  • both onTrueReturn/onFalseReturn have an edge to the parameter's primary node
  • on return true|false;, connect the current flow-state of the parameter with the onTrueReturn/onFalseReturn node.
  • on return complexCondition, connect the current flow-state with both special nodes
  • at the call side, if the method is called in an if condition, use the onTrueReturn/onFalseReturn nodes as flow-states for the assigned variable on the two if branches
  • after nullabilities are inferred for nodes, if the primary node is nullable and exactly one of onTrueReturn/onFalseReturn is non-nullable, emit the [NotNullWhen(..)] attribute

dgrunwald avatar Jun 10 '20 10:06 dgrunwald

There's a downside to adding more complex constraints to the graph: having constraints that don't directly correspond to possible compiler warnings, means that we end up minimizing something other than the number of warnings.

But I'm not sure the idea of repeatedly using Roslyn's flow-analysis works out either. The following is a valid warning-free program:

class Program
{
    public static string Test(string input)
    {
5:        string? a = null;
6:        a = input;
7:        return a;
8:    }
    public static int Main()
    {
11:        string x = string.Empty;
12:        for (int i = 0; i < 10; i++)
13:        {
14:            x = Test(x);
15:        }
16:        return x.Length;
    }
}

Test is the most simple case for flow analysis possible: the return type Test should only depend on the parameter type, never on the local variable. But the current approach without our own flow analysis produces this graph: SimpleFlow The minimum cut is marked with the red edge -- in this case we cut as close to the dereference as possible. This effectively marks all variables as nullable. In a second run, Roslyn's flow analysis won't do any better than on the first -- unless we tell Roslyn that input is non-nullable, it won't tell us that the return a; is non-nullable via flow analysis. But due to the function being fed its own output in Main, we won't mark the parameter as non-nullable unless Roslyn tells us the return can be non-nullable.

Iterating doesn't work here. We can't solve this case with Roslyn's flow analysis unless we randomly decide to try marking input as non-nullable, but doing that for every variable would be extremely slow. On the other hand, a custom flow analysis would notice that the a = null; assignment is irrelevant for the function's return type when constructing the graph, thus allowing everything to be inferred correctly in just a single pass.

dgrunwald avatar Jun 11 '20 17:06 dgrunwald

That would work in a system that starts with everything as nullable, infers some non-null types, then run multiple passes (until fixed point) finding more non-null types.

Ah yes, I was assuming the former at the time of writing despite the extensive documentation indicating the latter!

GrahamTheCoder avatar Jun 11 '20 19:06 GrahamTheCoder

I just pushed a very simple flow-analysis that is barely sufficient to handle the case from my previous comment: SimpleWithFlow The only difference to the previous graph is that the return edge starts at input rather than a. But that's sufficient to allow a to be nullable without making the whole cycle nullable -> inference now can find a warning-free solution for this case.

I'll still need to add support for the return value of a condition implying a particular flow-state (as with TryGetValue) to handle the original GetNode function that started this issue.

dgrunwald avatar Jun 11 '20 20:06 dgrunwald

And now with 3e8ca2b0d05ab0d01bde0391eb1e2e089ed0f2e4, the original example from this issue:

9:	Dictionary<T, Node> mapping = new Dictionary<T, Node>();

11:	Node GetNode(T element)
	{
13:		Node? node;
14:		if (!mapping.TryGetValue(element, out node))
15:		{
16:			node = new Node();
17:			mapping.Add(element, node);
18:		}
19:		return node;
	}

results in this graph: GetNode-WithFlow And thus the GetNode return type can be non-nullable despite the local variable node being nullable. Note that unlike the mockup graph in the issue description, I ended up not creating helper nodes for most flow-states, instead directly using the original node. At the end of the if, TypeSystem.Join(<mapping!1#2>, <nonnull>) is used to combine the flow states from the two branches. While in general this can result in the creation of a helper node, in this case Join directly returns <mapping!1#2> as the "more nullable" of the two flow-states.

dgrunwald avatar Jun 11 '20 22:06 dgrunwald