iree icon indicating copy to clipboard operation
iree copied to clipboard

[GlobalOpt] Add pass with basic global data layout propagation analysis

Open Max191 opened this issue 11 months ago • 14 comments

This adds a new pass to do propagation of data layouts all the way to mutable GlobalOps. This is just the basic analysis flow for now, only capable of handling some trivial cases, but this will be extended.

The pass does an analysis of the full program, and finds subgraphs of values that can all have the same layout. The subgraphs are seeded on loads and stores to GlobalOps, and then transformations are computed from the source layout to the layout at propagation barriers. Then, if one of the barrier layouts is better than the original source layout, the globals are rewritten to the barrier layout, and the layout is propagated through the subgraph.

This PR sets in place the full program analysis, but does not implement any propagations or transformation logic for ops other than PackOp. Future PRs will extend this by adding propagation patterns and logic for computing transformations through different ops.

Max191 avatar Feb 27 '24 20:02 Max191

There is still a lot of code to add to this to actually implement the propagation piece. I think my plan is to make a new PR for each op to support propagation for. For now, all I have locally is tensor.insert_slice and tensor.extract slice, so I will make 2 more PRs on top of this one.

I also have some preprocessing patterns to add to this, and a pass to fold unit dims on global ops. I will make separate PRs for each of those as well. The global unit dim folding review can happen in parallel with this, so I'll open that now.

Max191 avatar Feb 27 '24 20:02 Max191

Love what it's doing, but terrified by the PackOp and static data layout coupling. We really are going to need to stop using PackOp at the global opt level - it's not good that it ever landed there, and I was promised we wouldn't be building out more stuff around it being there without a proper symbolic solution in place. If this is something where the PackOp use can be replaced with an interface and the data layout can be symbolic that's ok (stuff we can clean up without a rewrite) but otherwise it's tech debt and any code tied to it is going to end up getting thrown away or effectively forking the compiler. The litmus test is disabling MaterializeHomogeneousEncodings - if something doesn't work then it's tech debt. There's shades of tech debt and pragmatism but the more we build on this path the more we're forking and the longer it's going to take to recover. /cc @MaheshRavishankar

benvanik avatar Feb 27 '24 20:02 benvanik

Love what it's doing, but terrified by the PackOp and static data layout coupling. We really are going to need to stop using PackOp at the global opt level - it's not good that it ever landed there, and I was promised we wouldn't be building out more stuff around it being there without a proper symbolic solution in place. If this is something where the PackOp use can be replaced with an interface and the data layout can be symbolic that's ok (stuff we can clean up without a rewrite) but otherwise it's tech debt and any code tied to it is going to end up getting thrown away or effectively forking the compiler. The litmus test is disabling MaterializeHomogeneousEncodings - if something doesn't work then it's tech debt. There's shades of tech debt and pragmatism but the more we build on this path the more we're forking and the longer it's going to take to recover. /cc @MaheshRavishankar

One of the things I had in mind when writing this was to be able to eventually extend this to some propagation of an encoding. I have left most implementation details semi-coupled to the PackOp, but I needed something material to go off of to start. Ultimately, the analysis flow should work if we swap out the implementations of the DataLayoutTransformations in this PR with something less coupled to PackOp. The propagation implementations (which aren't part of this PR) are a little more tightly coupled to specifically PackOp, but I wasn't exactly sure how to get around that, since they have to rewrite the IR.

It would be great to get feedback on the analysis about core pieces that seem a little too coupled to static packing layouts, though, because I 100% agree that we should try not to dig this hole deeper.

Max191 avatar Feb 27 '24 20:02 Max191

Love what it's doing, but terrified by the PackOp and static data layout coupling. We really are going to need to stop using PackOp at the global opt level - it's not good that it ever landed there, and I was promised we wouldn't be building out more stuff around it being there without a proper symbolic solution in place. If this is something where the PackOp use can be replaced with an interface and the data layout can be symbolic that's ok (stuff we can clean up without a rewrite) but otherwise it's tech debt and any code tied to it is going to end up getting thrown away or effectively forking the compiler. The litmus test is disabling MaterializeHomogeneousEncodings - if something doesn't work then it's tech debt. There's shades of tech debt and pragmatism but the more we build on this path the more we're forking and the longer it's going to take to recover. /cc @MaheshRavishankar

So couple of comments from my side

  1. I dont know what MaterializeHomogenousEncodings is doing. My sense of this is that it was a solution that was implemented trying to accomodate different feedback, and some of the ideas might have been lost in translation. I think it is only needed for VMVX backend.
  2. Discussed this offline with Ben a bit. I have no idea how to do propagation on something that is quite abstract like encodings (not saying it cant be done, just saying I dont know how to do it). It will essentially be implementing the propagation logic here but using a more abstract representation that I dont fully know how to work out. (Just FTR : I think such propagations only work on a class of models anyway, but that is an important class and we need to support it). So in my mind this propagation is not the final state of things anyway. It is a semi-exploratory step for a class of models to verify understanding of what is missing. I fully expect this needs to be revisited. (Not saying that this is not a great step, just noting that I dont think this is the final state here, but is more of a path finding).

MaheshRavishankar avatar Feb 29 '24 20:02 MaheshRavishankar

I dont know what MaterializeHomogenousEncodings is doing. My sense of this is that it was a solution that was implemented trying to accomodate different feedback, and some of the ideas might have been lost in translation. I think it is only needed for VMVX backend.

IIUC, currently the most important thing MaterializeHomogenousEncodings does is drop all encodings when we statically know that no target is going to set anything other than a default encoding. Not saying this is the end state we want, but if we drop the pass now there will be noticeable regressions on other (non-dt) backends because there is no propagation of encodings, and fusion of encodings isn't what it needs to be.

I think that the litmus test for whether we can drop that pass is, do we generate (almost) the exact same code with and without that pass for a target that does not use them.

qedawkins avatar Feb 29 '24 20:02 qedawkins

I dont know what MaterializeHomogenousEncodings is doing. My sense of this is that it was a solution that was implemented trying to accomodate different feedback, and some of the ideas might have been lost in translation. I think it is only needed for VMVX backend.

IIUC, currently the most important thing MaterializeHomogenousEncodings does is drop all encodings when we statically know that no target is going to set anything other than a default encoding. Not saying this is the end state we want, but if we drop the pass now there will be noticeable regressions on other (non-dt) backends because there is no propagation of encodings, and fusion of encodings isn't what it needs to be.

I think that the litmus test for whether we can drop that pass is, do we generate (almost) the exact same code with and without that pass for a target that does not use them.

Ok, My bad, Was thinking of a different pass (I think MaterializeUpperBoundTileSizePass). Ok now this makes more sense. So this was added so that you could use consteval to fuse some pack with globals. In my mind to make this all work without too much a fragile pass, const eval and this pack propagation pass needs to move to Stream, potentially after things have been mapped to devices. I understand how that makes things difficult, you have already found dispatches etc, and this makes pack propoagation effectively a global data flow pass. It is a much more well defined path, but that is in "just need to write code" category. Propagation on encodings seems harder, and much more fragile.

MaheshRavishankar avatar Feb 29 '24 21:02 MaheshRavishankar

In my mind to make this all work without too much a fragile pass, const eval and this pack propagation pass needs to move to Stream, potentially after things have been mapped to devices. I understand how that makes things difficult, you have already found dispatches etc, and this makes pack propoagation effectively a global data flow pass. It is a much more well defined path, but that is in "just need to write code" category. Propagation on encodings seems harder, and much more fragile.

Ok I'm definitely missing details around how that would work on Stream (especially w/ padding), but that does sound like a nice forcing solution for all of the fusion concerns.

qedawkins avatar Feb 29 '24 21:02 qedawkins

Yeah, I've got some ideas - TLDR is encodings should be SSA values associated with tensors just like we do with dynamic dimensions, and just like we propagate/fold/inline dynamic dimensions across the program we can propagate encodings, and only after that's all done bake them out in codegen and synthesize builtins for repacking if needed. That's why I think what's being done here is not durable, but I don't think it should block it landing to get us to a better state with the current non-durable MaterializeHomogeneousEncodings pass - they'll likely both go away together.

benvanik avatar Feb 29 '24 21:02 benvanik

Yeah, I've got some ideas - TLDR is encodings should be SSA values associated with tensors just like we do with dynamic dimensions, and just like we propagate/fold/inline dynamic dimensions across the program we can propagate encodings, and only after that's all done bake them out in codegen and synthesize builtins for repacking if needed. That's why I think what's being done here is not durable, but I don't think it should block it landing to get us to a better state with the current non-durable MaterializeHomogeneousEncodings pass - they'll likely both go away together.

Would love to have a chat about that or see a proposal for how that works, because I'm not seeing how it connects without looking inside the contents of the executables (and potentially replicating executables with different encodings if there is a conflict). (Unless that's what you mean)

qedawkins avatar Feb 29 '24 21:02 qedawkins

Definitely don't need to peer inside executables - this is all feed-forward (symbolic encodings are assigned and kept consistent all the way until we know what devices they are used on). Think of it just like a Tensor type in a normal eager runtime: it'll store shape, type, and encoding/strides and pass it around to allow different code acting on it to query those attributes and handle them like if tensor_a.encoding == tensor_b.encoding: nop else: convert. We can do the exact same thing in the compiler, and knowing that %tensor_a_encoding == %tensor_b_encoding lets us elide repacking and if not equal we know how to synthesize the repacking (moving across devices/etc). And just like at runtime how you would have a get_padding_for_encoding(tensor_a.encoding) we can have an op that does the same, perform the padding, and then fold that to a constant when we know what it is (or don't, and it's just a dynamic shape like any other).

I don't think this will end up being difficult, and to codegen it'll be the same (this will all happen before we run translation). Anything relying on packs and static shapes at the whole-program level, though, will have issues and that's why we need to make sure we quarantine anything added that does for easy removal.

benvanik avatar Feb 29 '24 21:02 benvanik

I'm either not understanding what "propagate encoding" means then, or I'm not understanding what we're expected to produce in Flow. Say I do something like this in Flow/GlobalOptimization

%0 = linalg.generic ... { arith.divf }
%1 = set_encoding
%2 = linalg.matmul %1
%3 = unset_encoding

which then forms dispatches approximately like the following

%1 = flow.dispatch {
  %0 = linalg.generic ... { arith.divf }
}
%2 = encoding_ssa_value
flow.dispatch %2 {
  %3 = linalg.matmul %1
}

Now I want to propagate the encoding on the input to the matmul to its producer (after conversion to Stream and such), however I can't do that because the only valid padding value for the matmul is 0, and I can't introduce divide by 0 in the first dispatch by propagating.

qedawkins avatar Feb 29 '24 22:02 qedawkins

The encoding should be propagated before then. Think of encoding changes like casts or reshapes. I think w.r.t. this PR that's the interesting thing it's doing we may be able to preserve.

benvanik avatar Feb 29 '24 22:02 benvanik

Ok that makes sense then, but I'm still a little lost on details...

(not looking for a response now, can cross that bridge once we come to it, but the progress here is good!)

qedawkins avatar Feb 29 '24 22:02 qedawkins

Yep - details are what we have to work through - it won't be possible to fully articulate them without actually solving the problem :)

benvanik avatar Feb 29 '24 22:02 benvanik