llvm icon indicating copy to clipboard operation
llvm copied to clipboard

[SYCL][CUDA] Layout accumulator is specified at load/store.

Open JackAKirk opened this issue 3 years ago • 1 comments

This is a move towards the future looking joint_matrix, joint_matrix_load, joint_matrix_store APIs. The aim is to make the CUDA and Intel implementations of the joint_matrix extension use matching interfaces, whilst enabling all functionality of both backends.

Signed-off-by: JackAKirk [email protected]

JackAKirk avatar Aug 29 '22 15:08 JackAKirk

Updated usage of joint_matrix can be seen in the changes here: https://github.com/intel/llvm-test-suite/pull/1183.

JackAKirk avatar Aug 29 '22 15:08 JackAKirk

@dkhaldi @yubingex007-a11y @gmlueck

matrix-unified.hpp contains the agreed interfaces for joint_matrix_load, joint_matrix_store and joint_matrix_mad. These functions call backend implementations depending on compiler flags __NVPTX__ __SPIR__ (and later we can also add amd flags). I've added the backend implementations for CUDA in the matrix-tensor-cores.hpp file. This is just a draft aimed at finding any technical issues with the unified approach but when https://github.com/intel/llvm/pull/6957 is merged I will pull in those changes and update the flags usage.

The main implementation issue that I think we will face is the redefinition of partial specializations of the joint_matrix struct in the AMX/CUDA backends. These backends use completely different definitions of joint_matrix but have overlapping template parameters. Ideally I think we can select the correct definitions depending on the backend. Unless you can see another solution? Here you can see that I have also separated the unified joint_matrix struct in joint-matrix.hpp from the CUDA backend partial specializations of joint_matrix in joint-matrix-cuda-impl.hpp.

Do you think that we could use the driver to select the correct partial specializations in a similar manner to how https://github.com/intel/llvm/pull/6524/files#diff-f8c64e36dfe3828a6f816c4550e78bb0305769ace1be53207e86ac9a3280ac9e selects the correct bfloat16 native library?

Also you might want to check that you can call the intel implementations from matrix-unified.hpp, replacing the joint_matrix cuda partial specializations with the intel ones, in order to check that there are no other technical issues we need to consider when we unify sooner rather than later.

Tests using the unified interface in the cuda backend: https://github.com/intel/llvm-test-suite/pull/1183

JackAKirk avatar Oct 11 '22 09:10 JackAKirk