iree icon indicating copy to clipboard operation
iree copied to clipboard

Add support for encoding-specialized tensor dimensions.

Open benvanik opened this issue 7 months ago • 5 comments

As part of the split-k work @MaheshRavishankar is cooking (https://hackmd.io/@nrClt5OeTCODL2pAtlBVvg/BJkmLyrzxl) the concrete need for target-specific tensor dimensions has finally come forward. It sounds like there are other potential uses for this (@qedawkins mentioned stream-k). With the recent encoding attribute interface work from @hanhanW we have the ability to specialize storage size based on logical tensor dimensions but this work entails specializing the logical tensor dimensions as well.

A (probably) representative example of this extracted from the split-k design doc:

%splitk_empty = iree_linalg_ext.empty(%dim)
    : tensor<?x?xf32, #iree_encoding.splitk<dim : 1>>
%splitk_val = tensor.dim %splitk_empty, %c1 : tensor<?x?xf32>
%splitk_init = linalg.fill ins(%cst : f32)
    outs(%splitk_empty : tensor<?x?xf32>) -> tensor<?x?xf32>
%dispatch0 = flow.dispatch.region[%d0, %d1, %splitk_val]
    -> (tensor<?x?xf32>{%d0, %splitk_val}) {

Here the %splitk_val logical tensor dimension is based on whatever target specializes the #iree_encoding.splitk attribute during IREE::Stream::SpecializeEncodingsPass. Since the synthesized dimension is captured and propagated through workloads and into dispatch regions we need an SSA value at the flow level (or potentially higher) that must be tied to where the dispatch actually executes later on (its assigned affinities).

There are 3 pipeline stages we have to go through to handle the proposed iree_linalg_ext.empty op:

  1. Convert to flow.tensor.empty with captured SSA values for each dynamic dimension
  2. Convert to stream.tensor.sizeof + stream.tensor.empty with the same captured SSA values for each dynamic dimension
  3. Specialize both stream ops using the encoding attribute in the SpecializeEncodingsPass based on target affinities

We must have the SSA values for the dynamic dimensions to satisfy the verifiers, canonicalizers/folders, and passes that use the IREE::Util::ShapeAwareOpInterface. Making them support missing dynamic dimensions (those that cannot resolve to an SSA value without a tensor.dim) isn't something that we can really do.

If we can make (2) happy (the flow->stream conversion) then (3) solves itself. My initial thought is to add a new stream.tensor.shapeof op that given an encoding + available dynamic dimensions returns all the dynamic dimensions. Since we assign affinities during the flow->stream conversion already we can have a consistent assignment between the decomposed ops:

%dim0_again, %splitk_val = stream.tensor.shapeof on(#some.device) tensor<?x?xf32, #iree_encoding.splitk<dim : 1>>{%dim0}
%resource_size = stream.tensor.sizeof on(#some.device) tensor<?x?xf32, #iree_encoding.splitk<dim : 1>>{%dim0_again, %splitk_val}
%resource = stream.tensor.empty on(#some.device) : tensor<?x?xf32, #iree_encoding.splitk<dim : 1>>{%dim0_again, %splitk_val} in !stream.resource<*>{%resource_size}

The only op that needs to be aware of partially-available dynamic dimensions is the stream.tensor.shapeof op and that does not need to implement the IREE::Util::ShapeAwareOpInterface (since it's a shape producer, and explicitly about resolving dimensions). We'll need something on some encoding dialect interface that allows for the "tell me how many dynamic dimensions you need vs how many you produce" query. The #iree_encoding.splitk attribute (and ones like it) would implement the interface and based on its dim value and which dimensions of the tensor type are dynamic return the appropriate value. In this example since dim 1 is the split dimension dimension 0 is required (%dim0) and then 1 additional is produced. This would let us verify the shapeof op and support arbitrary mappings (if we wanted targets to be able to transpose/etc). There's a chance of such generic behavior pessimizing allocation packing and such but it's fun to think about.

The shapeof op would return all dynamic dimensions of the ShapedType. This avoids leaking the logic as to what dimensions are pass-through and what are synthesized from the encoding: the result dimension count and order is always consistent with ShapedType::getNumDynamicDims/ShapedType::getDynamicDimIndex. Folders/canonicalizers can make that better for consumers by passing through known values and ensuring we have SSA value equality as early as possible (because so much relies on that).

We could probably get way more complex logic on the dimension mapping (and likely will) and having the shapeof op be the only place we have to handle various ways of synthesizing dynamic dimensions lets us contain that. We don't want stream.tensor.sizeof to support this behavior: those two ops are how one returns information from targets/encodings for shape and size respectively. From an efficiency standpoint having the shapeof compute information the sizeof uses could lead to better CSE and avoid the encoding attr implementation from having to repeat IR emission. If we had sizeof also emit IR for the synthesized dimension as part of the size calculation we'd have to rely on trust that it generates IR in both routines that perfectly CSEs. We'd also be paying 2x the cost for the IR emission. shapeof also gives a good singular place to return a new dimension that we'll use for all ops that require the dimension - if we had shapeof and sizeof separate then ops may end up with a mix of encoded (from shapeof) and unencoded (from the sizeof operand dims) dimensions and that prevents us from doing the more complex remappings/adjustments to dimensions.

That solves the lowering into stream and specialization so only (1) remains: iree_linalg_ext.empty -> flow.tensor.fancy_partial_dim_encoded_empty. @MaheshRavishankar's design elegantly solves the tied shape issue in the linalg-ext op: the tensor returned from the op is always valid and tensor.dim can always be performed on it to get the SSA value. The dim op is always dependent on the tensor and wherever that tensor-producing empty op goes it follows. To lower into stream we need to perform an affinity analysis that tracks that tensor value consumption - tensor.dim ops are not tracked (they just return integers) and won't be associated with the producer dispatch affinity. We could either make flow.tensor.empty/flow.tensor.alloca support the encoding interface ala stream.tensor.shapeof or add a new op that does the same thing. The tricky part about making flow.tensor.empty behave differently is that we need to return the new SSA values for each synthesized dimension (%empty_tensor, %splitk_val = flow.tensor.empty : tensor<?x?xf32, #iree_encoding.splitk<dim : 1>>{%dim0}). I don't recall how much I coded assuming that not be the case, though. It's probably the cleanest solution, if it doesn't require a bunch of special casing everywhere. An alternative is to make a new op that does the same thing and add an IREE::Flow::TensorAllocaOpInterface to handle most things for empty/alloca/fancy_empty. It may mostly depend on the naming: what would we call the flow.tensor.empty_tensor_and_its_synthesized_dims? Some more examples of where we'd use this may help figure that out - not that names can't be changed, but that not being able to name something is a yellow flag for not understanding the problem :).

Tasks:

  • [ ] Enumerate possible uses for the feature (split-k, stream-k, etc)
  • [ ] Sketch iree_encoding attr interface for synthesized dimensions
  • [ ] Resolve flow.tensor.? op choice
  • [ ] Implement!

benvanik avatar Jun 05 '25 02:06 benvanik

Enumeration of the algorithms I can see us needing this for:

  1. Split-k

Described above

  1. Stream-k

Stream-k has the same need as split-k for extra memory for storing intermediate accumulators, however instead of splitting reduction dimensions based on a constant factor, splits are computed dynamically as a function of the hardware parallelism. So while split-k on a matmul needs tensor<M x N x (K ceildiv SplitFactor)> scratch space, Stream-K needs tensor<Mtile x Ntile x f(device)> (typically f(device) ~ # workgroup processors) where Mtile and Ntile are the tile sizes processed by a workgroup.

  1. Avoiding recompute of expensive operations on certain architectures

#17469 is a deep dive on the tradeoffs of recomputing exponentials for softmax (i.e. deciding to fuse vs not). In the same vein as stream-k, we could cache the intermediate computation on a per workgroup scratch of size tensor<(tile size) x # logical workgroups)>. By picking a number of workgroups based on the hardware parallelism (something like 2x number of cores) and having the compiler give each workgroup # tiles per workgroup = total # tiles / # logical workgroups, we can get some benefit out of the softmax fusion even for targets that really don't want to compute exponentials twice.

If I'm reading this right, similar to --task_worker_local_memory but instead the amount allocated is set in the compiler based on the number of logical workgroups rather than the number of physical ones.

qedawkins avatar Jun 05 '25 14:06 qedawkins

Do you think you can sketch out what an iree_linalg_ext.empty with an encoding would look like? The Mtile/Ntile thing is interesting and I think I understand that to be the case I mention above where all the dimensions change, which means we definitely want to go the route of shapeof returning all dynamic dimensions (not just substituting ones not provided originally). e.g. %Mtile, %Ntile, %CUs = stream.tensor.shapeof on(#some.device) tensor<?x?x?xf32, #iree_encoding.streamk<...>{%M, %N} or something?

benvanik avatar Jun 05 '25 14:06 benvanik

Still thinking it through, but it depends on how much we want to specialize tile sizes. For example, stream-k almost always wants to pick the same tile sizes, in which case we don't even need anything other than the device

%streamk_empty = iree_linalg_ext.empty() : tensor<?x?x?xf32, #iree_encoding.streamk<element types>>>

We could also choose to go maximally verbose and pass the full iteration space to specialize the tile sizes.

%matmul = linalg.matmul
%unfolded_iteration_space:n = linalg_ext.get_iteration_space %matmul
%streamk_empty = iree_linalg_ext.empty(%unfolded_iteration_space:n) : tensor<?x?x?xf32, #iree_encoding.streamk<element types>>>

Note that the rank of the encoded tensor is always 3-d, regardless of the rank of the matmul (i.e. outer dimensions don't matter, they just contribute to the total tile count). Also note that it's possible for this tensor to resolve to size 0 (if the tile count == #CUs).

qedawkins avatar Jun 05 '25 15:06 qedawkins

You'd always need to pass something if there's dynamic dimensions (we need the inputs to whatever interface function returns the dimensions) as the returned tensor has to have a shape. I don't have the right names for things, but similar to dispatches where we pass a workload that later gets transformed into a workgroup count this kind of does the same: the SSA values passed to the empty op are just inputs to the encoding interface function that produces the dimensions for the type it's returning. If padding, for example, the operands would be the original dimensions and the results would be the padded ones. The encoding attribute has to carry enough information to know how many operands it needs and how many results it produces. The operands probably aren't associated at all with the base ShapedType and instead whatever the encoding carries - tensor<?x?x?x?x?xf32, #some_encoding_derived_from<tensor<?x?xf32>> would take two operands (the dim0/dim1 of the original tensor) and produce 5 result dimensions (each of the ShapedType dimensions) when queried.

So all that to say, if we reason about it like that then we can support both your examples: it's up to the encoding attribute to decide what it takes as operands (and importantly, how to help whoever creates these empty ops get those required values... somehow :)

benvanik avatar Jun 05 '25 16:06 benvanik

The encoding attribute has to carry enough information to know how many operands it needs and how many results it produces. The operands probably aren't associated at all with the base ShapedType and instead whatever the encoding carries

Exactly. Maybe I got us off track, but

My initial thought is to add a new stream.tensor.shapeof op that given an encoding + available dynamic dimensions returns all the dynamic dimensions.

This sounds sufficient for all cases to me (dynamic dims map surjectively, bijectively, or injectively); the encoding will be responsible for the mapping.

qedawkins avatar Jun 05 '25 16:06 qedawkins