composable_kernel icon indicating copy to clipboard operation
composable_kernel copied to clipboard

[avx2] design/issues with avx2 prototyping

Open carlushuang opened this issue 2 years ago • 0 comments

This issue tracks the issues when developing avx2 CK

  1. CPU only compile. A lot of headers are included hip_runtime.h, and use __device__ / __host__ symbol to describe host/device code. Better decouple device related code for CPU only.

  2. DynamicBuffer contains GPU intrinsic for memory operation. for CPU operation, may need utilize avx related intrinsic.

  3. __attribute__((ext_vector_type(N))) seems not recognize, 64/126/256 bit register. For ext_vector_type(8) will generate 2 xmm register, for ext_vector_type(4) will single generate xmm (this is wanted), but ext_vector_type(2)still generate singlexmm`. This gives us some difficulty for implementing the vector type on CPU.

  4. Also, each ymm/xmm can not iterate over the inner 8 float / 4 float one by one, and apply an element wise operation. This register must be treated as a whole. Hence StaticallyIndexedArray can not be utilized.

  5. register for frontend programming are limited, this implies we don't prefer to implement thread local buffer by using register to hold data. So every micro kernel will need to write the result into memory (cache) then do next iteration.

  6. for level of task distribution, we design following multi-level gemm: a). thread wise gemm: this is the micro kernel, with A/B matrix hope to exist in L1 cache. b). block wise gemm: A/B matrix hope to exist in L2/L3 cache, or we call it cache block. Unlike the naming, this is still run on a single thread. c). grid wise gemm: this is the whole task size. And we try do multi-thread on this level.

  7. numa binding, thread binding In multi-thread environment, bind thread to different core will have a big performance difference, especially on Current Zen chiplet design.
    TODO:Zen optimization guide

  8. tile blocking support not evenly divided block. a). gridwise/block/thread wise gemm need calculate current block size at runtime instead of compile time b). threadwise gemm distribute to different kernel. c). threadwise transfer need deal with unevenly divided size and packing (or not packing).

  9. transpose while read/write using avx register (how to describe by tensor transform)

  10. DimAccessOrder with openmp e.g. Order is <0, 1, 2>, and we need merge dim:1, dim:2 to utilize openmp for multi thread distribution

  11. dynamic threadwise copy today GPU use static_ford to the copy dimension. But for cpu, the number of iteration would be thousands or tens of thousands, which is not good enough to statically expand code.

carlushuang avatar Apr 07 '22 06:04 carlushuang