Halide icon indicating copy to clipboard operation
Halide copied to clipboard

Applying .split() to .fuse()'d vars produces strange results

Open steven-johnson opened this issue 3 years ago • 1 comments

I suspect this is a case of "I'm using it wrong", so this may be more of a request for assistance than a bug report.

Let's say I have a densely-packed interleaved RGB image (no padding) for which I'm computing a histogram. To maximize use of SIMD when loading the input pixels, I thought I could just fuse together the vars into a single one, then re-split that fused var into simd-register-sized chunks to process one at a time (with GuardWithIf in place to handle the tail). Here's a trivial example:

    Var x, y, c, z;

    ImageParam input(UInt(8), 3);

    Func input_pixel;
    input_pixel(x, y, c) = input(x, y, c);

    constexpr int hist_dims = 8;

    RDom r(0, input.width(), 0, input.height(), "r");

    Expr x_bin = cast<int>(input_pixel(r.x, r.y, 0)) * hist_dims / 256;
    Expr y_bin = cast<int>(input_pixel(r.x, r.y, 1)) * hist_dims / 256;
    Expr z_bin = cast<int>(input_pixel(r.x, r.y, 2)) * hist_dims / 256;

    Func hist_3d;
    hist_3d(x, y, z) = cast<int>(0);
    hist_3d(x_bin, y_bin, z_bin) += 1;

    // Schedule

    // Input is interleaved RGB
    input.dim(0).set_stride(3);
    input.dim(2).set_extent(3).set_stride(1);

    constexpr int vec_size = 16;

    RVar rxy, rxo, rxi;
    hist_3d.update(0)
        .fuse(r.x, r.y, rxy)
        .split(rxy, rxo, rxi, vec_size, TailStrategy::GuardWithIf);

    // The goal here is to load 3 simd registers worth of input
    // at a time; the input image is assumed to be densely packed
    // with no padding, so we should be able to just process
    // nice chunks at a time (plus a tail handler)
    Var cx, cxy, cxyo, cxyi;
    input_pixel
        .store_in(MemoryType::Register)
        .bound(c, 0, 3).bound_extent(c, 3)
        .reorder(c, x, y).reorder_storage(c, x, y)
        .fuse(c, x, cx).fuse(cx, y, cxy)
        .split(cxy, cxyo, cxyi, vec_size * 3, TailStrategy::GuardWithIf)
        .vectorize(cxyi)
        .compute_at(hist_3d, rxo);

    // Note that the width here is *not* an exact multiple of 16
    Buffer<uint8_t, 3> input_buf = Buffer<uint8_t, 3>::make_interleaved(106, 120, 3);
    // contents unimportant
    input_buf.fill(0);

    input.set(input_buf);
    auto buf = hist_3d.realize({hist_dims, hist_dims, hist_dims});

Weirdness #1: This looks plausible to me, but running it fails with Was unable to infer constant upper bound on extent of realization input_pixel. Use Func::bound_extent to specify it manually. -- this doesn't make much sense to me because it seems the realized extent of input_pixel should unambiguously be 3 * vec_size, but ok, I'm missing something.

Weirdness #2: If you change input_pixel to be store_in(MemoryType::Stack) and run, you now fail with Input buffer input is accessed at 238, which is beyond the max (119) in dimension 1. This also is kinda baffling; all the splits I see are tailed with GuardWithIf, and a cursory inspection of the pseudocode doesn't show any obvious (to my eye) way that such an overread would likely occur.

steven-johnson avatar Mar 18 '22 00:03 steven-johnson

Note: I accidentally omitted the stride specification for input that indicates it is dense:

input.dim(1).set_stride(input.dim(2).extent() * 3);

...but adding that doesn't change either observed weirdness.

steven-johnson avatar Mar 18 '22 01:03 steven-johnson