torchchat icon indicating copy to clipboard operation
torchchat copied to clipboard

[FEATURE REQUEST] Clang vectoriation on ARM: `warning: loop not vectorized`

Open mikekgfb opened this issue 1 year ago • 7 comments

(py311) mikekg@mikekg-mbp torchchat % python torchchat.py export --output-dso s.so  --quant '{"embedding": {"bitwidth":8, "groupsize": 32}}' --checkpoint-path ${MODEL_PATH} --temperature 0
Using device=cpu
Loading model...
Time to load model: 0.04 seconds
Quantizing the model with: {'embedding': {'bitwidth': 8, 'groupsize': 32}}
Time to quantize model: 0.05 seconds
Exporting model using AOT Inductor to /Users/mikekg/memory/x/z/a/b/torchchat/s.so
/Users/mikekg/memory/x/z/a/b/torchchat/cjks6zm6fxtuhqcxm7zrxesso4ksap62pjzfrfjhak7h5djxutyu.cpp:523:17: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
extern "C" void cpp_fused_index_put_stack_1(const float* in_ptr0,
                ^
/Users/mikekg/memory/x/z/a/b/torchchat/cjks6zm6fxtuhqcxm7zrxesso4ksap62pjzfrfjhak7h5djxutyu.cpp:1112:17: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
extern "C" void cpp_fused_index_put_stack_6(const float* in_ptr0,
                ^
/Users/mikekg/memory/x/z/a/b/torchchat/cjks6zm6fxtuhqcxm7zrxesso4ksap62pjzfrfjhak7h5djxutyu.cpp:1645:17: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
extern "C" void cpp_fused_index_put_stack_11(const float* in_ptr0,
                ^
/Users/mikekg/memory/x/z/a/b/torchchat/cjks6zm6fxtuhqcxm7zrxesso4ksap62pjzfrfjhak7h5djxutyu.cpp:2197:17: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
extern "C" void cpp_fused_index_put_stack_16(const float* in_ptr0,
                ^
/Users/mikekg/memory/x/z/a/b/torchchat/cjks6zm6fxtuhqcxm7zrxesso4ksap62pjzfrfjhak7h5djxutyu.cpp:2758:17: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
extern "C" void cpp_fused_index_put_stack_21(const float* in_ptr0,
                ^
/Users/mikekg/memory/x/z/a/b/torchchat/cjks6zm6fxtuhqcxm7zrxesso4ksap62pjzfrfjhak7h5djxutyu.cpp:3310:17: warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
extern "C" void cpp_fused_index_put_stack_26(const float* in_ptr0,
                ^
6 warnings generated.
The generated DSO model can be found at: /Users/mikekg/memory/x/z/a/b/torchchat/s.so
(py311) mikekg@mikekg-mbp torchchat % 

cc: @manuelcandales @malfet @swolchok

mikekgfb avatar Apr 27 '24 04:04 mikekgfb

@mikekgfb could you please upload the generated .cpp file? If the content is confidential you may be able to reduce the code with "creduce"

nadavrot avatar Apr 30 '24 03:04 nadavrot

Yes having the generated .cpp file would help us get right into the investigation (as we're not super familiar with all the setup yet).

WenleiHe avatar Apr 30 '24 04:04 WenleiHe

I was able to reproduce the warning with the toy model stories15M.pt (not the exact cpp source though). The whole cpp file is a bit too large to share. One of the loops looks like this. The warning is about the line #pragma omp simd simdlen(4). Just scanning the code without looking at the compile log, there are some places worth checking. For example: the pragma is used for the outer loop, not the inner loop. Also, not sure if simdlen(4) could be always satisfied by the compiler. In addition, it may worth checking compiler's autovec v.s. openmp simd (e.g. just let autovec do all the unrolling and vectorization without openmp pragma). Will share compile log once it's ready.

        #pragma omp simd simdlen(4) 
        for(long x0=static_cast<long>(8L*(c10::div_floor_integer(ks0, 8L))); x0<static_cast<long>(ks0); x0+=static_cast<long>(1L))
        {
            for(long x1=static_cast<long>(0L); x1<static_cast<long>(6L); x1+=static_cast<long>(1L))
            {
                for(long x2=static_cast<long>(0L); x2<static_cast<long>(24L); x2+=static_cast<long>(1L))
                {
                    auto tmp0 = in_ptr0[static_cast<long>((2L*x2) + (48L*x1) + (288L*x0))];
                    auto tmp1 = in_ptr1[static_cast<long>(x0)];
                    auto tmp7 = in_ptr0[static_cast<long>(1L + (2L*x2) + (48L*x1) + (288L*x0))];
                    auto tmp14 = in_ptr3[static_cast<long>((2L*x2) + (48L*x1) + (288L*x0))];
                    auto tmp16 = in_ptr3[static_cast<long>(1L + (2L*x2) + (48L*x1) + (288L*x0))];
                    auto tmp2 = decltype(tmp1)(tmp1 + 4096);
                    auto tmp3 = tmp1 < 0;
                    auto tmp4 = tmp3 ? tmp2 : tmp1;
                    AOTI_TORCH_CHECK((0 <= tmp4) & (tmp4 < 4096L), "index out of bounds: 0 <= tmp4 < 4096L")
                    auto tmp5 = in_ptr2[static_cast<long>((2L*x2) + (48L*tmp4))];
                    auto tmp6 = decltype(tmp0)(tmp0 * tmp5);
                    auto tmp8 = in_ptr2[static_cast<long>(1L + (2L*x2) + (48L*tmp4))];
                    auto tmp9 = decltype(tmp7)(tmp7 * tmp8);
                    auto tmp10 = decltype(tmp6)(tmp6 - tmp9);
                    auto tmp11 = decltype(tmp7)(tmp7 * tmp5);
                    auto tmp12 = decltype(tmp0)(tmp0 * tmp8);
                    auto tmp13 = decltype(tmp11)(tmp11 + tmp12);
                    auto tmp15 = decltype(tmp14)(tmp14 * tmp5);
                    auto tmp17 = decltype(tmp16)(tmp16 * tmp8);
                    auto tmp18 = decltype(tmp15)(tmp15 - tmp17);
                    auto tmp19 = decltype(tmp16)(tmp16 * tmp5);
                    auto tmp20 = decltype(tmp14)(tmp14 * tmp8);
                    auto tmp21 = decltype(tmp19)(tmp19 + tmp20);
                    out_ptr0[static_cast<long>((2L*x2) + (48L*x1) + (288L*x0))] = tmp10;
                    out_ptr1[static_cast<long>((2L*x2) + (48L*x1) + (288L*x0))] = tmp13;
                    out_ptr2[static_cast<long>((2L*x2) + (48L*x1) + (288L*x0))] = tmp18;
                    out_ptr3[static_cast<long>((2L*x2) + (48L*x1) + (288L*x0))] = tmp21;
                }
            }
        }

helloguo avatar Apr 30 '24 06:04 helloguo

Good catch @helloguo, the LLVM loop vectorizer only vectorizes innermost loops. Mystery solved.

nadavrot avatar Apr 30 '24 17:04 nadavrot

@Jack-Khuu do you know the context?

cccclai avatar Feb 07 '25 21:02 cccclai

this looks like it's a feature request for clang/LLVM, not torchchat

swolchok avatar Feb 07 '25 21:02 swolchok

FYI @dcci @WenleiHe @helloguo

nadavrot avatar Feb 08 '25 01:02 nadavrot