Initial work to add Tensor Cores support in Halide
Tensor Cores are programmable cores that perform warp-level matrix-multiply and accumulate operations. It can greatly improve performance of matrix-multiply expressions written in Halide when the hardware is available.
This works by pattern-matching an expression of the form
RDom k(0, matrix_size);
C(x, y) += f32(A(k, y)) * f32(B(x, k));
and generating code that will realize the function using the GPU.
The current support is quite limited. The only data types currently support are float16_t for A and B and float for C. The dimentions of the input matrices must be a multiple of 16 since the only shape currently supported is m16n16k16. Also note that the generated PTX code is not the most efficient way to use tensor cores but it gives good results with a simple schedule.
The following table shows the performance comparison of the modified cuda_mat_mul app for a matrix size of 2048
| Configuration | Time (ms) | GFlops |
|---|---|---|
| CUDA 5.0 | 6.01 | 2854.78 |
| CUDA 7.0 (Tensor Cores) | 5.20 | 3300.91 |
The performance difference is not massive since the current code is not using shared memory
CPU: Intel Core i7-8700 GPU: NVIDIA RTX2060
Re-Syncing to master should fix the spurious Windows failure.
Looks like there are performance regressions on some of the buildbots with this PR in place.
Looks like there are performance regressions on some of the buildbots with this PR in place.
I made a mistake in my previous commit. Its fixed in 7f9719c749625df8a27ca258fa455d982cd6ea31
Looks like we're getting real failures in some tests.
Are there any other issues to address to have this PR merged?
It would be good to have some more test coverage for this in test/correctness to at least make sure the intrinsics are actually being generated (e.g. in the style of simd_op_check).
Also, I'm a little concerned that this seems to require a root-level loop that's just a matrix multiply. Is there a path to being able to schedule tensor-core-using ops at the blocks level within a larger kernel? E.g. how do I do a matrix-multiply fused with a relu in a single kernel? A tensor-core-using-stage should be able to be schedule compute_at gpu blocks of some other Func.
Finally this seems to be based on pattern matching the source, and if the code changes slightly it'll fall off a performance cliff with no error, which is generally not how Halide does things. I think it should be explicitly scheduled. Here was my original proposal for how this should look from the scheduling language: https://github.com/halide/Halide/issues/4481
Where does this PR stand? Is it going to get more attention?
Where does this PR stand? Is it going to get more attention?
Our plan is to address the issues in this PR and the AMX support PR in September.
Any update on this PR?
Is this PR still active? Should it be closed?
sorry about the late reply, yes it's still active
Hi guys, any updates?
The work on this PR has been paused for now as it's not been a priority lately, I'm not sure when the work would be resumed.
LLVM 11 is no longer supported by top-of-tree Halide. You should probably move to LLVM 14 instead.
On Thu, Jul 14, 2022 at 11:22 PM Jin Yue @.***> wrote:
@frengels https://github.com/frengels I made a test with this PR, the Halide IR seems to be correct. But down to the LLVM IR, the LLVM seems to opt to a void function:
// .globl _kernel_matrix_mul_s1_y_blockY___block_id_y // -- Begin function _kernel_matrix_mul_s1_y_blockY___block_id_y // @_kernel_matrix_mul_s1_y_blockY___block_id_y.visible .entry _kernel_matrix_mul_s1_y_blockY___block_id_y( .param .u64 _kernel_matrix_mul_s1_y_blockY___block_id_y_param_0, .param .u64 _kernel_matrix_mul_s1_y_blockY___block_id_y_param_1, .param .u64 _kernel_matrix_mul_s1_y_blockY___block_id_y_param_2, .param .u32 _kernel_matrix_mul_s1_y_blockY___block_id_y_param_3, .param .u32 _kernel_matrix_mul_s1_y_blockY___block_id_y_param_4, .param .u32 _kernel_matrix_mul_s1_y_blockY___block_id_y_param_5 ) {
// %bb.0: // %entry ret; // -- End function }
Any ideas on this? I'm using LLVM 11.1. And I checked the LLVM passes, it's InstCombine Pass that optimizes the function to void. Thanks.
— Reply to this email directly, view it on GitHub https://github.com/halide/Halide/pull/5995#issuecomment-1185216558, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACQJ66VP2TXU4XJSSTNMJLVUD7TNANCNFSM44V5YBVA . You are receiving this because you commented.Message ID: @.***>