Applying .split() to .fuse()'d vars produces strange results
I suspect this is a case of "I'm using it wrong", so this may be more of a request for assistance than a bug report.
Let's say I have a densely-packed interleaved RGB image (no padding) for which I'm computing a histogram. To maximize use of SIMD when loading the input pixels, I thought I could just fuse together the vars into a single one, then re-split that fused var into simd-register-sized chunks to process one at a time (with GuardWithIf in place to handle the tail). Here's a trivial example:
Var x, y, c, z;
ImageParam input(UInt(8), 3);
Func input_pixel;
input_pixel(x, y, c) = input(x, y, c);
constexpr int hist_dims = 8;
RDom r(0, input.width(), 0, input.height(), "r");
Expr x_bin = cast<int>(input_pixel(r.x, r.y, 0)) * hist_dims / 256;
Expr y_bin = cast<int>(input_pixel(r.x, r.y, 1)) * hist_dims / 256;
Expr z_bin = cast<int>(input_pixel(r.x, r.y, 2)) * hist_dims / 256;
Func hist_3d;
hist_3d(x, y, z) = cast<int>(0);
hist_3d(x_bin, y_bin, z_bin) += 1;
// Schedule
// Input is interleaved RGB
input.dim(0).set_stride(3);
input.dim(2).set_extent(3).set_stride(1);
constexpr int vec_size = 16;
RVar rxy, rxo, rxi;
hist_3d.update(0)
.fuse(r.x, r.y, rxy)
.split(rxy, rxo, rxi, vec_size, TailStrategy::GuardWithIf);
// The goal here is to load 3 simd registers worth of input
// at a time; the input image is assumed to be densely packed
// with no padding, so we should be able to just process
// nice chunks at a time (plus a tail handler)
Var cx, cxy, cxyo, cxyi;
input_pixel
.store_in(MemoryType::Register)
.bound(c, 0, 3).bound_extent(c, 3)
.reorder(c, x, y).reorder_storage(c, x, y)
.fuse(c, x, cx).fuse(cx, y, cxy)
.split(cxy, cxyo, cxyi, vec_size * 3, TailStrategy::GuardWithIf)
.vectorize(cxyi)
.compute_at(hist_3d, rxo);
// Note that the width here is *not* an exact multiple of 16
Buffer<uint8_t, 3> input_buf = Buffer<uint8_t, 3>::make_interleaved(106, 120, 3);
// contents unimportant
input_buf.fill(0);
input.set(input_buf);
auto buf = hist_3d.realize({hist_dims, hist_dims, hist_dims});
Weirdness #1: This looks plausible to me, but running it fails with Was unable to infer constant upper bound on extent of realization input_pixel. Use Func::bound_extent to specify it manually. -- this doesn't make much sense to me because it seems the realized extent of input_pixel should unambiguously be 3 * vec_size, but ok, I'm missing something.
Weirdness #2: If you change input_pixel to be store_in(MemoryType::Stack) and run, you now fail with Input buffer input is accessed at 238, which is beyond the max (119) in dimension 1. This also is kinda baffling; all the splits I see are tailed with GuardWithIf, and a cursory inspection of the pseudocode doesn't show any obvious (to my eye) way that such an overread would likely occur.
Note: I accidentally omitted the stride specification for input that indicates it is dense:
input.dim(1).set_stride(input.dim(2).extent() * 3);
...but adding that doesn't change either observed weirdness.