Halide LLVM ERROR from Halide 18.0.0

I got the error below when using Halide-18.0.0-x86-64-windows-41bc134ae9a8fa32d968867ac1aeeac6f63a142e, which I downloaded from https://buildbot.halide-lang.org/:

LLVM ERROR: Cannot select: t37: ch = masked_store<(store unknown-size into %ir.sum15, align 64, !tbaa !45)> t0, t28, FrameIndex:i64<0>, undef:i64, t35 t28: v4f64 = BUILD_VECTOR ConstantFP:f64<0.000000e+00>, ConstantFP:f64<0.000000e+00>, ConstantFP:f64<0.000000e+00>, ConstantFP:f64<0.000000e+00> t13: f64 = ConstantFP<0.000000e+00> t13: f64 = ConstantFP<0.000000e+00> t13: f64 = ConstantFP<0.000000e+00> t13: f64 = ConstantFP<0.000000e+00> t12: i64 = FrameIndex<0> t15: i64 = undef t35: v4i1 = setcc t30, t33, setle:ch t30: v4i32 = extract_subvector t2, Constant:i64<0> t2: v8i32,ch = CopyFromReg t0, Register:v8i32 %23 t1: v8i32 = Register %23 t29: i64 = Constant<0> t33: v4i32 = extract_subvector t4, Constant:i64<0> t4: v8i32,ch = CopyFromReg t0, Register:v8i32 %24 t3: v8i32 = Register %24 t29: i64 = Constant<0> In function: Convolve

My Halide Generator class is attached: myHalideGenerator.txt

My command to run my Halide Generator class is: myHalideGenerator.exe -g Convolve -f Convolve input.type=float64 kernel.type=float64 output.type=float64 target=x86-64-windows-large_buffers-enable_llvm_loop_opt-avx512-avx2-avx-sse41-no_runtime-no_asserts -o ./

I found this error also happens with x86-64-osx package.

Jul 11 '24 19:07 jxl1080

Can confirm this is due to the feature flag avx512.

❯ DYLD_LIBRARY_PATH=../../../distrib/lib ./Convolve -g Convolve -f Convolve input.type=float64 kernel.type=float64 output.type=float64  target=host-avx512-no_runtime-no_bounds_query -o ./
LLVM ERROR: Cannot select: 0x7fd8ca04c2d0: ch = masked_store<(store unknown-size into %ir.lsr.iv21, align 8, !tbaa !51)> 0x7fd8ca041890, 0x7fd8ca046910, 0x7fd8ca045a60, undef:i64, 0x7fd8ca04a810
  0x7fd8ca046910: v4f64,ch = load<(dereferenceable load (s256) from %ir.sum38, align 64, !tbaa !35)> 0x7fd8ca046400, FrameIndex:i64<0>, undef:i64
    0x7fd8ca046240: i64 = FrameIndex<0>
    0x7fd8ca04c8f0: i64 = undef
  0x7fd8ca045a60: i64,ch = CopyFromReg 0x7fd8c9909f60, Register:i64 %89
    0x7fd8ca0461d0: i64 = Register %89
  0x7fd8ca04c8f0: i64 = undef
  0x7fd8ca04a810: v4i1 = setcc 0x7fd8ca04a180, 0x7fd8ca04bfc0, setle:ch
    0x7fd8ca04a180: v4i32 = extract_subvector 0x7fd8ca0a5b70, Constant:i64<0>
      0x7fd8ca0a5b70: v8i32,ch = CopyFromReg 0x7fd8c9909f60, Register:v8i32 %44
        0x7fd8ca045b40: v8i32 = Register %44
      0x7fd8ca04be70: i64 = Constant<0>
    0x7fd8ca04bfc0: v4i32 = extract_subvector 0x7fd8ca046f30, Constant:i64<0>
      0x7fd8ca046f30: v8i32,ch = CopyFromReg 0x7fd8c9909f60, Register:v8i32 %45
        0x7fd8ca0ac720: v8i32 = Register %45
      0x7fd8ca04be70: i64 = Constant<0>
In function: Convolve

Pipeline compiles fine without avx512. @jxl1080 I updated your generator to this:

class Convolve : public Halide::Generator<Convolve> {
public:
    // We declare the Inputs to the Halide pipeline as public
    // member variables. They'll appear in the signature of our generated
    // function in the same order as we declare them.

    Input<Buffer<>> input{"input", 2};
    Input<Buffer<>> kernel{ "kernel", 1 };
    Input<uint32_t> outputDim{"inputLen"};

    Output<Buffer<>> output{ "output", 2 };

private:
    Var x{"x"},c{"c"};
    Expr filterLen;
public:
    // We then define a method that constructs and return the Halide
    // algorithm pipeline:
    void generate() {
        filterLen = kernel.dim(0).extent();
        Halide::RDom rk(0, filterLen);
        output(x,c) = Halide::sum(kernel(rk.x) * input(x + rk.x,c));
    }
    // scheduling pipeline:
    void schedule() {
        Expr vectorSize = natural_vector_size(output.type());
        output.vectorize(x, vectorSize, TailStrategy::GuardWithIf);
    }
};
HALIDE_REGISTER_GENERATOR(Convolve, Convolve)

Jul 12 '24 07:07 mcourteaux

I tried mcourteaux's modified generator class, it still failed with avx512. Thus a fix for this bug is still needed.

Jul 15 '24 14:07 jxl1080

LLVM ERROR: Cannot select: t37: ch = masked_store<(store unknown-size into %ir.sum15, align 64, !tbaa !45)> t0, t28, FrameIndex:i64<0>, undef:i64, t35

This may well be a bug in LLVM 18 (rather than Halide itself). Can you try with top-of-tree LLVM + top-of-tree Halide and see if it still repros?

Jul 15 '24 16:07 steven-johnson

I tried mcourteaux's modified generator class, it still failed with avx512. Thus a fix for this bug is still needed.

I was just trying to give some feedback. Was by no means meant as a fix. Was showing you that you can access buffer extents: you don't have to explicitly pass them as extra arguments.

Jul 16 '24 11:07 mcourteaux

I am also encountering the same error when compiling with the avx512 flag in Halide v19. However, the compilation works fine when targeting other avx512 variants, such as avx512_cannonlake, avx512_skylake, avx512_zen4, etc.

Additionally, I do not face this issue when using the avx512 flag with Halide v16.

Any advice on how to resolve this?

Feb 19 '25 05:02 lokesh-0706

@abadams could you take a look at this? I often see you working with the avx512 codegen. Please see my comment above, that might save you some time.

Feb 19 '25 07:02 mcourteaux

That's an LLVM bug, so there's possibly not much we can do about it. I wouldn't ever use the avx512 flag by itself. It asks for the lowest-common-denominator avx512, which is the intersection of the instructions supported by both avx512 CPUs, and those xeon phi accelerators from a few years ago. This amounts to the AVX512 F and CD extensions. I'd use at least avx512_skylake, which is the F, CD, BW, VL, and DQ extensions. I'm trying to figure out if there are any appreciable number of cpus out there than have avx512f but not avx512bw. Wikipedia claims some of the early Xeon skylake-sp processors didn't have it, but any specific processor in that category that I check on wikichip claims to have it, so I'm not sure who's wrong here.

Feb 19 '25 18:02 abadams

Given all the issues, we should just deprecate/remove that flag, since it's both buggy and apparently-not-useful for real world hardware.

Feb 19 '25 21:02 steven-johnson