llvm [SYCL][CUDA] Layout accumulator is specified at load/store.

This is a move towards the future looking joint_matrix, joint_matrix_load, joint_matrix_store APIs. The aim is to make the CUDA and Intel implementations of the joint_matrix extension use matching interfaces, whilst enabling all functionality of both backends.

Signed-off-by: JackAKirk [email protected]

Aug 29 '22 15:08 JackAKirk

Updated usage of joint_matrix can be seen in the changes here: https://github.com/intel/llvm-test-suite/pull/1183.

Aug 29 '22 15:08 JackAKirk

@dkhaldi @yubingex007-a11y @gmlueck

matrix-unified.hpp contains the agreed interfaces for joint_matrix_load, joint_matrix_store and joint_matrix_mad. These functions call backend implementations depending on compiler flags __NVPTX__ __SPIR__ (and later we can also add amd flags). I've added the backend implementations for CUDA in the matrix-tensor-cores.hpp file. This is just a draft aimed at finding any technical issues with the unified approach but when https://github.com/intel/llvm/pull/6957 is merged I will pull in those changes and update the flags usage.

The main implementation issue that I think we will face is the redefinition of partial specializations of the joint_matrix struct in the AMX/CUDA backends. These backends use completely different definitions of joint_matrix but have overlapping template parameters. Ideally I think we can select the correct definitions depending on the backend. Unless you can see another solution? Here you can see that I have also separated the unified joint_matrix struct in joint-matrix.hpp from the CUDA backend partial specializations of joint_matrix in joint-matrix-cuda-impl.hpp.

Do you think that we could use the driver to select the correct partial specializations in a similar manner to how https://github.com/intel/llvm/pull/6524/files#diff-f8c64e36dfe3828a6f816c4550e78bb0305769ace1be53207e86ac9a3280ac9e selects the correct bfloat16 native library?

Also you might want to check that you can call the intel implementations from matrix-unified.hpp, replacing the joint_matrix cuda partial specializations with the intel ones, in order to check that there are no other technical issues we need to consider when we unify sooner rather than later.

Tests using the unified interface in the cuda backend: https://github.com/intel/llvm-test-suite/pull/1183

Oct 11 '22 09:10 JackAKirk

llvm llvm copied to clipboard

[SYCL][CUDA] Layout accumulator is specified at load/store.

llvm
llvm copied to clipboard