Corbin Robeck
Corbin Robeck
This PR adds basic functionality test of leveraging the matrix cores on AMD gfx908 and gfx90a hardware for dense matrix products.
Compiling with gcc 8.1.0 + nvcc 11.2.67 gives the following build error: /RAJAPerf/src/basic/MAT_MAT_SHARED-Cuda.cpp: In instantiation of 'void rajaperf::basic::MAT_MAT_SHARED::runCudaVariantImpl(rajaperf::VariantID) [with long unsigned int block_size = 256]': /tmp/tmpxft_002d5158_00000000-6_MAT_MAT_SHARED-Cuda.cudafe1.stub.c:12:193: required from here RAJAPerf/src/basic/MAT_MAT_SHARED-Cuda.cpp:234:4:...
AMD gfx90a architecture has built in hardware support (matrix cores) for dense matrix operations. The two that are of interest to RAJA are: __builtin_amdgcn_mfma_f64_4x4x4f64 __builtin_amdgcn_mfma_f64_16x16x4f64 Done criteria: - Add a...
1. Cover the full variety of different reducers in RAJA 2. Add a test that does one min reduction 3. Add a test that does one min-loc reduction 4. Add...
We are interested in kernels for solving the advection equation: M[du/dt] = K u with DG-FEM. We can break this up into two parts. 1. y = inv(M) x Since...
Currently sizeof(long double) returns different values in the AMDGPU backend depending if called from the CPU or GPU. This could cause storage/alignment issues if not handled correctly. Add a test...
This is technically an architecture agnostic patch but as of current AMDGPU is the only backend that would make use of it.
Follow on to: https://github.com/triton-lang/triton/pull/4137/
Add the init and allocation of the Proton dialect device buffer that can be used in place of the shared memory buffer. The device buffer is just a module local,...