IR step for optimizing initialization of aggregates
Motivation
Initialization of aggregates currently produces a series of field-by-field value assignments. In the case of nested aggregates, field-by-field initialized nested aggregate is in addition stored in an intermediate local and memcopied into the parent aggregate.
There is a lot of room for improvement here. E.g., the following initialization:
A {
a: 0,
b: [0; 10],
c: C { x: 0, y: 0, z: 0 }
d: (0, 0, 0),
}
will currently result in one mem_clear_val, one mem_copy_val, and seven stores.
Obviously, it can be optimized down to just a single mem_clear_val.
The only isolated optimization we did so far was introducing using mcli/mem_clear_val for zero-initializing array repeats in #7299. However, this can be seen just as a special case of zero-initializing aggregates in general. Which is in addition again just a special case of initializing aggregates.
#7299 was about improving encoding and decoding of arrays and the optimization had only that scope in mind.
E.g., while [0; N] will be reduced to a single mem_clear_val, [0, 0, 0, 0, ...] will still be initialized either as element-by-element stores or a loop, depending on the number of elements.
Also, [S { a: 0 }; 5] will be initialized as element-by-element producing five stores, while [S { a: 0 }; 10] will produce a loop.
Having, e.g., [S { a: 0, b: 0}; <5 or 10>] will double the number of stores.
It is obvious that in all these cases, we can still have just a single mem_clear_val.
The goal of this issue is to provide a way to produce an optimized initialization for an arbitrary aggregate initialization, not only for zero-initialized aggregates. More on that in the below chapter "Expected Results"
Expected Results
We expect aggregates to always be initialized in the most optimal way in terms of gas and bytecode cost.
Zero-initialized Aggregates
Initializing zero-initialized aggregates must always be optimized down to a single mem_clear_val, no matter how complex the aggregate structure is.
A simple measurement shows that even for small structs, e.g., S { a: u64, b: u64 } mem_clear_val brings small wins compared to field-by-field initialization. The wins grow linearly with the size of an aggregate.
Non Zero-initialized Aggregates
Initializing non zero-initialized aggregates must be optimized down to a minimum/optimal number of mem_clear_vals and stores and zero mem_copy(_val)s.
E.g:
A {
a: 0,
b: [0; 10],
c: C { x: 0, y: 0, z: 1_000 }
d: (0, 0, 0),
}
should be optimized down to a single mem_clear_val follow by a store of 1_000. There must be no temporary intermediates created for the array stored in b, or C stored in c.
Note that the values of individual fields are not necessarily constants. E.g, instead of 1_000 above, we could have a result of a runtime function call. Still, the initialization should compile down to a single mem_clear_val follow by a store of the function call result.
It is important that the optimization works well with SROA that must still be able to eventually replace the aggregate with scalars even if mem_clear_val was used. The repeat array optimization introduced in #7299 works perfectly well with SROA.
We have to come up with a good heuristics here. E.g., in this case:
S {
a: 0,
b: 42,
}
having two stores is more efficient than a mem_clear_val and a store for 42.
Implementation Proposal
Note that the current optimization for repeat arrays is done effectively in AST to IR phase, in the initial IR. This is not practical nor the right place for general optimizations of aggregate initialization described above. We want to have those optimizations expressed in the IR as an optimization step.
However, implementing optimizations requested above is difficult in the current IR. E.g., for arrays it would require inspecting branches and blocks and trying to recognize patterns for array initialization that can be arbitrary complex and include other loops in case of, e.g., arrays nested in arrays, etc. But even for "regular" aggregates, analyzing arbitrary mem_copy_vals and stores the figure out if they represent an initialization of an aggregate can be extremely difficult even for simple cases of nested aggregates.
Note that this is a kind of reverse engineering. We have init aggregate concept in the AST that gets lost in the IR and then we try to reconstruct it back in the IR in order to optimize it.
Therefore, the proposal is to introduce a higher level IR instruction, init_aggr that represent a high-level concept of an aggregate initialization. This will:
- make AST to IR transformation trivial. Every aggregate initialization in AST maps to a single
init_aggrinstruction. - preserve the semantics of initialization in the IR. No need to reverse engineer it back.
- simplify writing analysis and optimization of such initialization. It will be possible to have a general analysis that works recursively on an arbitrary complex aggregate structure without revers-engineering complex IR.
A dedicated optimization step "Optimizations related to initialization of aggregates" would then analyze instances of init_aggr and lower them down to optimal combinations of mem_clear_vals, stores, or loops.
The disadvantage of this approach is that we would have an IR instruction that cannot be lowered down to ASM directly. In other words, the optimization step would be mandatory.
Note that this introduces a kind of MLIR approach in which we have a higher, close-to-language level init_aggr instruction that lowers down to other general IR instructions.
How could such init_aggr instruction look like? Below is a sketch of a such instruction given on the case of initializing the following aggregate. The instruction returns the same pointer it gets as input, the pointer to an aggregate to initialize.
A {
a: 0,
b: [0; 10],
c: C { x: 0, y: 0, z: some_fn() }
d: (0, 0, 0),
}
The above example would initially compile to something like:
v_c0 = const u64 0
v_c10 = const u64 10
v_fn_result = call some_fn
v_ptr_arr_1 = get_local __ptr [u64; 10], __anon_0
v_ptr_struct_c_1 = get_local __ptr { u64, u64, u64 }, __anon_1
v_ptr_tuple_1 = get_local __ptr { u64, u64, u64 }, __anon_2
v_ptr_arr_2 = init_aggr v_ptr_arr { u64 v_c0 x v_c10 } // Or maybe: init_aggr v_arr [u64 v_c0; v_c10]. Or even: init_arr v_arr [u64 v_c0; v_c10].
v_ptr_struct_c_2 = init_aggr v_ptr_struct_c { u64 v_c0, u64 v_c0, u64 v_fn_result }
v_ptr_tuple_2 = init_aggr v_ptr_tuple { u64 v_c0, u64 v_c0, u64 v_c0 }
v_val_arr = load v_ptr_arr_2
v_val_struct_c = load v_ptr_struct_c_2
v_val_tuple = load v_ptr_tuple_2
v_ptr_var_a = get_local __ptr { u64, [u64; 10], { u64, u64, u64 }, {u64, u64, u64 } }, var_a
init_aggr v_ptr_var_a { u64 v_c0, [u64; 10] v_val_arr, { u64, u64, u64 } v_val_struct_c, { u64, u64, u64 } v_val_tuple }
The analysis would search for the outermost init_aggr and walk it up to inspect the initial values and if they are constants or results of other init_aggr or regular values etc.
Open Questions
- Do we want to have
init_aggr? I don't see now how a general analysis could be written on current low-level initial IR accurately and simply and for an arbitrary case of initialization. - How should the instruction look like exactly? Do we have a single
init_aggror different ones for arrays, structs/tuples, and enums? - One more possible optimization to think of is when all parts are constants but not necessarily zeros. Do we want to emit a constant in the data section and just
memcopyit? In cases of larger aggregates this could pay off. How this overlaps withconst fnwe want to have?
Initial feedback from @vaivaswatha and my remarks:
the proposal is to introduce a higher level IR instruction, init_aggr that represent a high-level concept of an aggregate initialization
Just an initial comment, without a lot of contemplation: LLVM achieves this via constants that can be aggregates (among other valid types). So there is ConstantAggregate, and then ConstantAggregateZero etc.
The disadvantage of this approach is that we would have an IR instruction that cannot be lowered down to ASM directly. In other words, the optimization step would be mandatory
While I agree with what you state, I don't see it that way. We could "simply lower" the instruction to ASM, and that lowering can actually be "optimal". i.e., we don't really need to "optimize" anything. Just translate this instruction right during asm gen.
LLVM achieves this via constants that can be aggregates (among other valid types). So there is ConstantAggregate, and then ConstantAggregateZero etc.
init_aggr is not limited only to constants. In original examples were only constants which is deceiving indeed. I've updated the description and examples to emphasize that the values passed to init_aggr can be any registers, not only constants.
Regardless of that, it makes sense to take a closer look at ConstantAggregate/Zero and see if we can borrow from that.
We could "simply lower" the instruction to ASM, and that lowering can actually be "optimal". i.e., we don't really need to "optimize" anything. Just translate this instruction right during asm gen.
I also thought about this but wasn't much in favor of it, because it would essentially mean moving what we now how in AST to IR to IR to ASM. Unless we do it really naively we would still need to translate array inits as loops which is all much difficult to express in the ASM. I though of "cheating" and having init_aggr first translated to today's lower IR in the ASM but this is all messy and mixing concernes. And besides, it is the question if not having init_aggr optimized away is actually a compilation error. If we see it as higher level quasi-MLIR instruction that must be lowered.
Do we want to have init_aggr? I don't see now how a general analysis could be written on current low-level initial IR accurately and simply and for an arbitrary case of initialization.
I think we should. It ensures that the initialization details aren't lost in the IR.
How should the instruction look like exactly? Do we have a single init_aggr or different ones for arrays, structs/tuples, and enums?
I think a single instruction is sufficient. An accompanying type can describe the shape. But, going into the details of your example of initializing
A {
a: 0,
b: [0; 10],
c: C { x: 0, y: 0, z: some_fn() }
d: (0, 0, 0),
}
It would be better if the entire AST initializer is translated to a single init_aggr instruction, that contains all of the data here, rather than splitting into multiple init_aggr and then having an analyser scan backwards to gather the details.
init_aggr v_ptr_var_a [0, [0, ... 0], [0, 0, v_fn_result], [0, 0, 0]],
The initializer itself contains arbitrarily nested arrays of values (constants / SSA registers). The shape of that can be verified in the verifier against the type of the aggregate being initialized.
So in InstOp, we can have a new variant entry as AggrInit(Value, Initializer) where the first element is a pointer to the local being initialized and Initializer can be:
enum Initializer {
Value(Value),
Aggregate(Vec<Initializer>)
}
We can run our heuristics / analyasis on just Initializer to generate the best code to initialize the aggregate. If unoptimized, it'll reach ASM gen, and ASM gen will just generate a flattened sequence of stores.
It would be better if the entire AST initializer is translated to a single init_aggr instruction, that contains all of the data here, rather than splitting into multiple init_aggr and then having an analyser scan backwards to gather the details.
I don't see benefits here, on the contrary, it will just bring more complexity and provide only a "visual niceness" and that also only in the less often case of nested aggregates. Let me explain.
Firstly, let's consider this case:
let s = S {
a: 0,
b: if some_fact {
some_function(x)
} else if some_other_fact {
(1, 2, 3)
}
};
init_aggr for s must have Initializer::Value for the second argument, that represents the phi of the if:
init_aggr v_ptr_s [0, v_phi_of_if]
Still, when we optimize, we want to write (1, 2, 3) into the enclosing aggregate (likely individual stores, or memcpy from a global constant if we decide to go with the open question 3.) and not into a temporary that will be memcopied to the main aggregate. memcpyopts will not do that because of aliasing/overlapping. This is the advantage of init_aggr that we know its semantics and can safely optimize more.
Since we want to optimize, we will anyhow have to follow the Initializer::Value v_phi_of_if to see if there is an initializer behind it. In which case the distinction between Initializer::Value and Initializer::Aggregate just makes the traversal more complex because we need to match but still traverse both variants.
Secondly, AST to IR translation suddenly becomes unnecessary very complicated. We must implement special cases over several different expressions (structs, tuples, arrays, enums). Moreover, every expression must pass the information if it is embedded in one of an aggregate initilization expressions, to know if it should emit an IR Value or return the Initializer to the parent expression... The complexity and for sure "ugliness" of this code will be considerable. Not to mention compared with the alternative of a straightforward local emitting of an IR Value like it is now, with zero changes anywhere else except in translation of aggregates to IR. Where again, the code will always be localized.
Last but not least, analyzing init_aggrs and constructing the initialization closures is a fairly trivial bottom-up instruction traversal. Not to mention compared to the change in the AST to IR translation required above. (Actually, I am thinking now, it might be that it takes less time to write that traversal then this long argument for having it 😄 Just kidding!).
One important thing, as indicated in this line in the description, we will need to support expressing repeated arrays:
--------
|
v
v_ptr_arr_2 = init_aggr v_ptr_arr { u64 v_c0 x v_c10 }
// Or maybe: init_aggr v_arr [u64 v_c0 x v_c10]. Or even: init_arr v_arr [u64 v_c0; v_c10].
Regardless how we print it, we will very likely have something like:
enum Initializer {
Value(Value),
Repeat(Value, u64)
}
If unoptimized, it'll reach ASM gen, and ASM gen will just generate a flattened sequence of stores.
I am indecisive on this one. We can do it that way for sure. The benefit is, the complete IR instruction set is translatable to ASM, no special cases.
The conceptual counterargument is, that if we semantically treat init_aggr as a "higher level IR", we must lower it in IR first. Entering philosophical waters now, I know 😄
The pragmatic counterargument is, we will be investing effort into writing lowering code that we don't expect to use.
Perhaps starting with a todo! or InternalCompilerError in ASM lowering would be just fine.
Since we want to optimize, we will anyhow have to follow the Initializer::Value v_phi_of_if to see if there is an initializer behind it. In which case the distinction between Initializer::Value and Initializer::Aggregate just makes the traversal more complex because we need to match but still traverse both variants.
We cannot optimize the initialization in this example. It'll just be init_aggr v_ptr_s [0, v_phi_of_if], and no traversing. It'll result in two stores for the initialization, whichever representation. In what I propose, we will never traverse backwards, from either of the variants.
Secondly, AST to IR translation suddenly becomes unnecessary very complicated.
I haven't looked into this, but I assumed that all this information is already part of the AST. So if the initializer is an aggregate (at any nesting) we retain that information, and if it's not, we emit instructions for it and use the SSA value as the initializer.
analyzing init_aggrs and constructing the initialization closures is a fairly trivial bottom-up instruction traversal.
I can definitely give it a try, but this is again "losing information and then computing it back".
we will need to support expressing repeated arrays:
The Initializer I suggested captures this naturally. More importantly, when we do the analysis over sequences of init_aggrs as you suggest, we will ultimately build the same data structure that I suggested.
So it all comes down to, will translation from AST->IR to the Initializer that I suggested complicate the translation. If it does, then we're better of not doing that, and recomputing that information from the IR. If it doesn't, then we should just retain all that info.
this is again "losing information and then computing it back"
This is a great point. I understand better now why you intended originally for the unified representation in the IR.
The paradox is, that we actually don't have this information in the AST ready 😉 In the AST tree for sure, but not ready at hand, we have to calculate it. It sounds counter-intuitive, I know, but the changes needed in the all AST to IR lowering mentioned above would actually do exactly that, give us the complete information about aggregate initialization, so that we can store it in a single structure.
The information is not there in the AST, it needs to be created based on the information in the AST. And looking at both sides, AST and IR, it is easier and cleaner to build that information in the IR.
Below I am sketching this in more detail. Essentially, the AggrInit(Value, Vec<Initializer>) already contains the structural information. The only thing that the initial analysis actually needs to do is finding the root initializations, those not contained in any others, which is a fairly trivial traversal. Once those root AggrInits are identified, the structural information is there.
We cannot optimize the initialization in this example. It'll just be init_aggr v_ptr_s [0, v_phi_of_if], and no traversing. It'll result in two stores for the initialization, whichever representation.
I am not sure I understand why we cannot optimize in this case. In the optimization algorithm where we just follow the AggrInits initializer's values back, there is no difference between:
init_aggr v_ptr_s [0, [1, 2, 3]]
and
v_phi_of_if = phi <fn call> or (1, 2, 3)
init_aggr v_ptr_s [0, [1, 2, 3]]
In both cases, when 1, 2, 3 is encountered and optimization decides to lowered it into three stores directly into the enclosing aggregate the lowering algorithm actually does not know at all that the phi exist. It is only the traversal that knows it has to visit predecessors on block arguments.
I am not sure if I am conveying my thought clearly enough. If needed, I can sketch in more detail what I mean. But roughly, we have two passes, the first is a simple bottom-up search for initialization roots, AggrInits that are not enclosed in any other AggrInits. Once a root is found, the remaining structure is already fully there in the root itself, since it is AggrInit(Value, Vec<Initializer>). The optimization just follow the root's initializers recursively navigating the values and making various conclusions and doing the lowering based on them: all zeros -> single mem_clear_val, etc.
What I mean, AggrInit(Value, Vec<Initializer>) already has all the structural information. Building it is essentially just a bottom-up traversal that finds the roots which is much simpler than doing the changes all over the AST.
In what I propose, we will never traverse backwards, from either of the variants.
I am not sure I fully understand this. I assume it's just how we perceive "backwards". In the proposed:
enum Initializer {
Value(Value),
Aggregate(Vec<Initializer>)
}
when we encounter Aggregate we have to traverse it's Vec<Initializer>. We can see this as going "inwards", but it is the same as traversing "backwards" the expressions that must be evaluated before the current initialization.
Still, when we optimize, we want to write (1, 2, 3) into the enclosing aggregate In both cases, when 1, 2, 3 is encountered and optimization decides to lowered it into three stores directly into the enclosing aggregate the lowering algorithm
I don't understand this. The value to be initialized is "either the result of the fn call, or (1, 2, 3)". So how can we just store (1, 2, 3) into the aggregate when initializing it?
enum Initializer {
Value(Value),
Repeat(Value, u64)
}
Do we need the Repeat? Initializing an array with all zeros is the same as initializing a struct with all 0 members. We can just see (after the initial analysis) that each individual initializer is zero and then decide on what to do. Is there a need to special case array initilization over struct inititialization?
The information is not there in the AST, it needs to be created based on the information in the AST. And looking at both sides, AST and IR, it is easier and cleaner to build that information in the IR.
If it isn't there "ready" in the AST, then I agree, we can lower it as you suggest, and then build back (by scanning through the IR), a structure similar to the Initializer that I suggest for further analysis.
The value to be initialized is "either the result of the fn call, or (1, 2, 3)". So how can we just store (1, 2, 3) into the aggregate when initializing it?
To summarize our discussion based on the outcome of the PoC. The cases like the one below can be done in a general case:
S {
field: if some_fact {
some_function(x)
} else if some_other_fact {
(1, 2, 3)
} else {
<some other expression>
}
}
But after #7443 the only actual performance benefit would be in cases when all the branches are init_aggr. E.g.:
S {
field: if some_fact {
(1, 1, 1)
} else if some_other_fact {
(1, 2, 3)
} else {
(3, 3, 3)
}
}
On the other side, the solution involves phi elimination which is in general a tricky thing to get right. Although in this case it might look like a trivial elimination, the experience we had in the past with debugging and fixing phi elimination related issues suggests that the effort and risk is not worth the benefits that can be gained only in one particular corner case.
The decision is therefore not to go the way of phi elimination and not to follow block arguments.
For the record, the PoC code became much simpler once we abandoned supporting this corner use case.
Big thanks @vaivaswatha for helping clarifying this!