Halide Initial work to add Tensor Cores support in Halide

Tensor Cores are programmable cores that perform warp-level matrix-multiply and accumulate operations. It can greatly improve performance of matrix-multiply expressions written in Halide when the hardware is available.

This works by pattern-matching an expression of the form

RDom k(0, matrix_size);
C(x, y) += f32(A(k, y)) * f32(B(x, k));

and generating code that will realize the function using the GPU.

The current support is quite limited. The only data types currently support are float16_t for A and B and float for C. The dimentions of the input matrices must be a multiple of 16 since the only shape currently supported is m16n16k16. Also note that the generated PTX code is not the most efficient way to use tensor cores but it gives good results with a simple schedule.

The following table shows the performance comparison of the modified cuda_mat_mul app for a matrix size of 2048

Configuration	Time (ms)	GFlops
CUDA 5.0	6.01	2854.78
CUDA 7.0 (Tensor Cores)	5.20	3300.91

The performance difference is not massive since the current code is not using shared memory

CPU: Intel Core i7-8700 GPU: NVIDIA RTX2060

May 11 '21 15:05 mcleary

Re-Syncing to master should fix the spurious Windows failure.

May 12 '21 21:05 steven-johnson

Looks like there are performance regressions on some of the buildbots with this PR in place.

May 14 '21 18:05 steven-johnson

Looks like there are performance regressions on some of the buildbots with this PR in place.

I made a mistake in my previous commit. Its fixed in 7f9719c749625df8a27ca258fa455d982cd6ea31

May 15 '21 17:05 mcleary

Looks like we're getting real failures in some tests.

Jun 08 '21 16:06 steven-johnson

Are there any other issues to address to have this PR merged?

Jul 07 '21 16:07 mmoadeli

It would be good to have some more test coverage for this in test/correctness to at least make sure the intrinsics are actually being generated (e.g. in the style of simd_op_check).

Also, I'm a little concerned that this seems to require a root-level loop that's just a matrix multiply. Is there a path to being able to schedule tensor-core-using ops at the blocks level within a larger kernel? E.g. how do I do a matrix-multiply fused with a relu in a single kernel? A tensor-core-using-stage should be able to be schedule compute_at gpu blocks of some other Func.

Finally this seems to be based on pattern matching the source, and if the code changes slightly it'll fall off a performance cliff with no error, which is generally not how Halide does things. I think it should be explicitly scheduled. Here was my original proposal for how this should look from the scheduling language: https://github.com/halide/Halide/issues/4481

Jul 09 '21 20:07 abadams

Where does this PR stand? Is it going to get more attention?

Aug 16 '21 17:08 steven-johnson

Where does this PR stand? Is it going to get more attention?

Our plan is to address the issues in this PR and the AMX support PR in September.

Aug 16 '21 20:08 mmoadeli

Any update on this PR?

Nov 03 '21 17:11 steven-johnson

Is this PR still active? Should it be closed?

Nov 15 '21 19:11 steven-johnson

sorry about the late reply, yes it's still active

Nov 24 '21 14:11 frengels

Hi guys, any updates?

Jul 08 '22 01:07 yuej-jin

The work on this PR has been paused for now as it's not been a priority lately, I'm not sure when the work would be resumed.

Jul 12 '22 16:07 frengels

LLVM 11 is no longer supported by top-of-tree Halide. You should probably move to LLVM 14 instead.

On Thu, Jul 14, 2022 at 11:22 PM Jin Yue @.***> wrote:

@frengels https://github.com/frengels I made a test with this PR, the Halide IR seems to be correct. But down to the LLVM IR, the LLVM seems to opt to a void function:
// .globl   _kernel_matrix_mul_s1_y_blockY___block_id_y // -- Begin function _kernel_matrix_mul_s1_y_blockY___block_id_y
                                    // @_kernel_matrix_mul_s1_y_blockY___block_id_y
.visible .entry _kernel_matrix_mul_s1_y_blockY___block_id_y( .param .u64 _kernel_matrix_mul_s1_y_blockY___block_id_y_param_0, .param .u64 _kernel_matrix_mul_s1_y_blockY___block_id_y_param_1, .param .u64 _kernel_matrix_mul_s1_y_blockY___block_id_y_param_2, .param .u32 _kernel_matrix_mul_s1_y_blockY___block_id_y_param_3, .param .u32 _kernel_matrix_mul_s1_y_blockY___block_id_y_param_4, .param .u32 _kernel_matrix_mul_s1_y_blockY___block_id_y_param_5 ) {

// %bb.0: // %entry ret; // -- End function }

Any ideas on this? I'm using LLVM 11.1. And I checked the LLVM passes, it's InstCombine Pass that optimizes the function to void. Thanks.

— Reply to this email directly, view it on GitHub https://github.com/halide/Halide/pull/5995#issuecomment-1185216558, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACQJ66VP2TXU4XJSSTNMJLVUD7TNANCNFSM44V5YBVA . You are receiving this because you commented.Message ID: @.***>

Jul 15 '22 17:07 steven-johnson