clspv Support for "long vector" types

Elements not covered by the initial support we intend to upstream:

Constant expressions, cf https://github.com/google/clspv/pull/655/files#r510348777
Additional LLVM intrinsics, cf #666
Phi nodes, cf #686
More complex forms of GEP, cf #694
Additional bitcasts, cf #732

Original description below.

Because "long vector" types (i.e. any vector type of arity 8 or 16) cannot be directly represented in Vulkan SPRI-V, their usages are currently rejected by the frontend.

We (arm) have been working on a prototype that lowers these vector types into equivalent aggregate of scalars and massages OpenCL C builtin function calls and other LLVM instructions to produce valid, equivalent Vulkan SPIR-V. The core functionality is implemented as an LLVM module pass. It is, in some ways, similar to the LLVM scalariser but it selectively scalarises instructions and handles further builtin functions as well as function parameters.

The benefit of this pass is to increase support for arm compute library (#349). Before providing our implementation, we wanted to give a high level description of how it works and make sure that the proposed approach looks sensible to you.

Here is an example of what the prototype currently supports:

static size_t getId() { return get_global_id(0); }
static float8 load(global float *ptr) { return vload8(getId(), ptr); }
static void process(float8 a, private float8 *b) { *b = fma(a, 4, 7); }
static void store(global float *ptr, float8 value) { vstore8(value, getId(), ptr); }

kernel void k(global float *in, global float *out) {
  float8 x = load(in);
  process(x * 2, &x);
  store(out, x);
}

Long vector types are supported as variable types and parameters types for user-defined functions but not for kernel parameters. It might be possible to support the latter case, too, but we left it out for now and an appropriate error message is provided to the user. Pointers to these types are supported as well. The pass will rewrite user-defined functions, when needed. E.g. the load function shown above is rewritten as follows in LLVM IR:

define internal spir_func { float, float, float, float, float, float, float, float } @load.1(float addrspace(1)* %0) #1 {
entry:
  %call.i = call spir_func i32 @getId() #5
  %1 = shl i32 %call.i, 3
  %2 = getelementptr float, float addrspace(1)* %0, i32 %1
  %3 = load float, float addrspace(1)* %2, align 4
  %4 = or i32 %1, 1
  %5 = getelementptr float, float addrspace(1)* %0, i32 %4
  %6 = load float, float addrspace(1)* %5, align 4
  %7 = or i32 %1, 2
  %8 = getelementptr float, float addrspace(1)* %0, i32 %7
  %9 = load float, float addrspace(1)* %8, align 4
  %10 = or i32 %1, 3
  %11 = getelementptr float, float addrspace(1)* %0, i32 %10
  %12 = load float, float addrspace(1)* %11, align 4
  %13 = or i32 %1, 4
  %14 = getelementptr float, float addrspace(1)* %0, i32 %13
  %15 = load float, float addrspace(1)* %14, align 4
  %16 = or i32 %1, 5
  %17 = getelementptr float, float addrspace(1)* %0, i32 %16
  %18 = load float, float addrspace(1)* %17, align 4
  %19 = or i32 %1, 6
  %20 = getelementptr float, float addrspace(1)* %0, i32 %19
  %21 = load float, float addrspace(1)* %20, align 4
  %22 = or i32 %1, 7
  %23 = getelementptr float, float addrspace(1)* %0, i32 %22
  %24 = load float, float addrspace(1)* %23, align 4
  %25 = insertvalue { float, float, float, float, float, float, float, float } undef, float %3, 0
  %26 = insertvalue { float, float, float, float, float, float, float, float } %25, float %6, 1
  %27 = insertvalue { float, float, float, float, float, float, float, float } %26, float %9, 2
  %28 = insertvalue { float, float, float, float, float, float, float, float } %27, float %12, 3
  %29 = insertvalue { float, float, float, float, float, float, float, float } %28, float %15, 4
  %30 = insertvalue { float, float, float, float, float, float, float, float } %29, float %18, 5
  %31 = insertvalue { float, float, float, float, float, float, float, float } %30, float %21, 6
  %32 = insertvalue { float, float, float, float, float, float, float, float } %31, float %24, 7
  ret { float, float, float, float, float, float, float, float } %32
}

(Technically, this is the output after the new pass and InstCombine. The lowering of long vectors introduces many instructions that are trivial to remove and do not impact, AFAIK, the quality of the generated Vulkan SPIR-V.)

As you can already see from the above snippet, vload8 was lowered into 8 getelementptr + load pairs of instructions and the resulting float8 object is represented using an aggregate of 8 floats.

Similarly, the multiplication x * 2 and the call to fma are replaced by 8 fmul instructions and 8 calls to @_Z3fmafff, respectively. We currently have support for a limited set of builtin functions using a simple mapping between mangled names. We appreciate this solution has some serious shortcoming but we feel confident that it can be replaced later on with a better alternative that scales to support all builtin functions without impacting the design of the overall lowering pass.

If this approach looks okay to you we will produce PRs over the next couple of weeks to share our code, after we have finished some minor cleanups. We can either do it in one go with a larger PR or in small chunks, as you prefer. For reference, the current bulk of the implementation is roughly 1K LOC, not counting the tests.

Jul 31 '20 09:07 mantognini

Why are kernel parameters not supported? I would have thought that would be necessary to handle most code using long vectors.

Does parameter handling include support for pointers to long vectors? Is that the case that is preventing kernel parameters from being supported? You don't have an example here of that.

I'm not sure I understand the problem with the builtins? Which calls are problematic to identify? The builtin demangler should be able to recognize all the functions.

Aug 05 '20 20:08 alan-baker

Thanks for your reply.

Why are kernel parameters not supported? I would have thought that would be necessary to handle most code using long vectors.

I personally believe they could be supported, but it implies lying to the runtime and I don't have proof that it will work with any runtime. As a first step, I suggest we leave this as unsupported for now. It can be revisited in the future without major redesign I think.

Supporting these is not strictly necessary because long vectors can be loaded from memory using vload8/16, but I agree that it would be a nice feature.

Does parameter handling include support for pointers to long vectors? Is that the case that is preventing kernel parameters from being supported? You don't have an example here of that.

Currently, both long vector types and pointer to long vector types are rejected when used as kernel parameters using a check in the frontend for the reason mentioned above.

However, both are supported as user-defined functions. For example, process above is rewritten as follows:

define internal spir_func void @process.2(
    { float, float, float, float, float, float, float, float } %0,
    { float, float, float, float, float, float, float, float }* %1
) #1 {
...

(Functions are renamed when they need rewriting but I'm hopeful this won't be the case in the future although it doesn't matter when they are internal functions.)

I'm not sure I understand the problem with the builtins?

There isn't really a problem. To clarify what I meant earlier, we took the simplest and cheapest solution we could find for this prototype and focused our attention to trickier areas. We took a test-driven approach to ensure the builtin functions were converted to the expected SPIR-V and therefore decided to limit the support of these builtins to the ones we have tests for. We identified this as a sub-problem that can be progressed independently once the core functionality is implemented.

Aug 06 '20 09:08 mantognini

The overall direction sounds reasonable. I agree it's good to implement this incrementally. We can improve the conversion later if necessary. Smaller PRs are generally preferable to single large ones.

Aug 06 '20 13:08 alan-baker

I've created #638 to introduce the first elements to support long-vectors. Please do let me know if at any point you feel the PRs are too small/too big. I'll submit sometimes multiple commits within the same PR but I'm happy to provide them separately instead if you prefer.

I would like that acknowledge and thank two colleagues at Arm, Pedro Olsen Ferreira and Anna-Mariia Koltsova, for their help and collaboration on this project.

Sep 09 '20 13:09 mantognini

With #732 merged, this completes our (Arm) prototype for long vector support. Some elements are known to be missing (cf non-exhaustive list at the top of the description + TODO in tests/code), but they can be addressed independently by anyone we expect.

If desired, I can open one issue per element on the list in the description and close this issue. Let me know your preference.

A personal thanks for all the good review comments!

Apr 16 '21 08:04 mantognini

We can leave this issue open at least until someone starts tackling the remaining work. Thanks for all the contributions!

Apr 16 '21 15:04 alan-baker

I think we can now close this issue.

Mar 29 '23 09:03 rjodinchr