Halide icon indicating copy to clipboard operation
Halide copied to clipboard

Add storage folding support to buffers

Open dsharlet opened this issue 1 year ago • 3 comments

In another project, I found it pretty useful and interesting to support fold factors on buffers directly, as opposed to adding modulo operations around coordinates in the code itself. Instead of dimensions of buffers being {min, extent, stride}, they would become {min, extent, stride, fold_factor}. The address computation would go from something like:

addr_of_x = base + sum((x_d - min_d) * stride_d)

to:

addr_of_x = base + sum(((x_d - min_d) % fold_factor_d) * stride_d)

Of course, some care would need to be taken for performance. I think this would need something similar to strides: the default stride of dimension 0 is 1, so vectorization doesn't turn into a gather by default. Similarly, the fold factor of dimension 0 should probably default to some value indicating "no folding".

However, for most dimensions, folding is nearly free, due to being computed in outer loops.

@abadams and @steven-johnson both noted that halide_dimension_t already has a flags field that is currently unused that could be used to support this parameter: https://github.com/halide/Halide/blob/22f9bb9247b3e384bbd9d8e7ff96501a29b49265/src/runtime/HalideRuntime.h#L1470-L1471

It might be useful and worth adding this to Halide. It would make it easier to plug Halide pipelines into other systems that work with storage folded buffers. Currently, the only option to do this would be to add a new parameter to the pipeline, and fold the dimension manually, by modding the coordinates in the code explicitly. However, doing this makes the bounds of the pipeline harder to reason about.

The use case for this is suppose you have two Halide pipelines A and B, where B is a stencil, and consumes A. You want to run the pipeline A -> B on a large buffer of size W x H. To do this, you can either do something like this:

allocate temp(W, H)
A(input, temp)
B(temp, output)

But this has poor locality and uses a lot of memory. You can instead do something like this, just like if this was all one pipeline, and used sliding window and storage folding in the schedule:

allocate temp(W, {H, 3})  // 3 is the fold factor of this dimension.
for y in [-2, H]:
  temp_y = crop_dim(temp, 1, [y + 1, y + 1]) // make a buffer referencing the row y + 1 only
  A(input, temp_y);
  if (y >= 0) {
    output_y = crop_dim(output, 1, [y, y])
    B(temp, output_y)
  }
}

This way, the intermediate buffer between the two pipelines is only 3 lines instead of H. Building storage folding into the buffer can hide all of the necessary address arithmetic needed to support this.

dsharlet avatar Jan 18 '24 20:01 dsharlet

Worth keeping in mind that on many platforms you can fake this with memory mapping tricks (mapping multiple addresses to the same page). The buffer allocator needs to do the fancy mapping, but the buffer consumer doesn't need to know about folding or do any modulos.

abadams avatar Jan 19 '24 20:01 abadams

Yes, although that approach is quite limiting (you can only fold on page aligned boundaries).

dsharlet avatar Jan 19 '24 22:01 dsharlet

I raised it because it's the only method I know of that doesn't require any modifications on the consumer side, so we could potentially use it for existing libraries where we have no ability to modify the source.

abadams avatar Jan 19 '24 23:01 abadams