Halide
Halide copied to clipboard
Add storage folding support to buffers
In another project, I found it pretty useful and interesting to support fold factors on buffers directly, as opposed to adding modulo operations around coordinates in the code itself. Instead of dimensions of buffers being {min, extent, stride}, they would become {min, extent, stride, fold_factor}. The address computation would go from something like:
addr_of_x = base + sum((x_d - min_d) * stride_d)
to:
addr_of_x = base + sum(((x_d - min_d) % fold_factor_d) * stride_d)
Of course, some care would need to be taken for performance. I think this would need something similar to strides: the default stride of dimension 0 is 1, so vectorization doesn't turn into a gather by default. Similarly, the fold factor of dimension 0 should probably default to some value indicating "no folding".
However, for most dimensions, folding is nearly free, due to being computed in outer loops.
@abadams and @steven-johnson both noted that halide_dimension_t
already has a flags field that is currently unused that could be used to support this parameter: https://github.com/halide/Halide/blob/22f9bb9247b3e384bbd9d8e7ff96501a29b49265/src/runtime/HalideRuntime.h#L1470-L1471
It might be useful and worth adding this to Halide. It would make it easier to plug Halide pipelines into other systems that work with storage folded buffers. Currently, the only option to do this would be to add a new parameter to the pipeline, and fold the dimension manually, by modding the coordinates in the code explicitly. However, doing this makes the bounds of the pipeline harder to reason about.
The use case for this is suppose you have two Halide pipelines A and B, where B is a stencil, and consumes A. You want to run the pipeline A -> B on a large buffer of size W x H. To do this, you can either do something like this:
allocate temp(W, H)
A(input, temp)
B(temp, output)
But this has poor locality and uses a lot of memory. You can instead do something like this, just like if this was all one pipeline, and used sliding window and storage folding in the schedule:
allocate temp(W, {H, 3}) // 3 is the fold factor of this dimension.
for y in [-2, H]:
temp_y = crop_dim(temp, 1, [y + 1, y + 1]) // make a buffer referencing the row y + 1 only
A(input, temp_y);
if (y >= 0) {
output_y = crop_dim(output, 1, [y, y])
B(temp, output_y)
}
}
This way, the intermediate buffer between the two pipelines is only 3 lines instead of H. Building storage folding into the buffer can hide all of the necessary address arithmetic needed to support this.
Worth keeping in mind that on many platforms you can fake this with memory mapping tricks (mapping multiple addresses to the same page). The buffer allocator needs to do the fancy mapping, but the buffer consumer doesn't need to know about folding or do any modulos.
Yes, although that approach is quite limiting (you can only fold on page aligned boundaries).
I raised it because it's the only method I know of that doesn't require any modifications on the consumer side, so we could potentially use it for existing libraries where we have no ability to modify the source.