tvm icon indicating copy to clipboard operation
tvm copied to clipboard

[TIR] Enhance Lower cross thread Pass

Open LeiWang1999 opened this issue 1 year ago • 4 comments

We currently only support lower cross thread with several constrains. For example, the lower_cross_thread only apples when the thread binding reduced axis is the innermost loop, and the block must have an init block. This can be a limiting for some cases.

For example, when tensorizing the reduction block (e.g., dp4a or mma), it becomes difficult to tensorize the init statement as well:

with T.block("block"):
    vi = T.axis.spatial(2, i_0 * 16 + i_1)
    vk = T.axis.reduce(32, k_0 * 64 + k_1)
    T.where(i_0 * 16 + i_1 < 2 and k_0 * 64 + k_1 < 32)
    T.reads(A[vi, vk])
    T.writes(B[vi])
    with T.init():
        B[vi] = T.float32(0)
    B[vi] = B[vi] + A[vi, vk]

Moreover, certain cases, like small gemm, prefer block reduction in shared memory to enhance parallelization to better utilize the hardware resources.

This pull request improves the lower_cross_thread pass, it can now handle the thread block reduce lowering with separate init and reduce blocks, and removes the constrain that the reduced axis is the innermost loop to support TensorCore with block reduction.

relevant test cases can be found at tests/python/tir-transform/test_tir_transform_lower_cross_thread_reduction.py.

Please CC @MasterJH5574 .

LeiWang1999 avatar Jul 03 '24 08:07 LeiWang1999

@LeiWang1999 please fix the lint and test case, @wrongtest-intellif do you mind help review the PR

tqchen avatar Jul 31 '24 19:07 tqchen

I’m attempting to remove theLoopVar with ForNode, and I’ve encountered an unexpected behavior.

For new_for = For(ax_lane_id, Integer(0), warp_size, ForKind::kThreadBinding, n->body);
const ForNode* new_for_node = new_for.get();
LOG(INFO) << "new_for->min " << new_for_node->min;
// Output: 0

This snippet works fine and logs the expected output of 0 for new_for->min.

However, when I try to instantiate new_for_node directly, like this:

const ForNode* new_for_node = For(ax_lane_id, Integer(0), warp_size, ForKind::kThreadBinding, n->body).get();
LOG(INFO) << "new_for->min " << new_for_node->min;
// Check failed: (tindex < type_table_.size() && type_table_[tindex].allocated_slots != 0) is false: Unknown type index 76391888

@wrongtest-intellif, do you have any thoughts?

LeiWang1999 avatar Aug 30 '24 17:08 LeiWang1999

However, when I try to instantiate new_for_node directly, like this:

node pointer do not take ownership of objects. it seems your new_for_node referenced object is already deconstructed in the previous line, with dangling pointer left.

wrongtest-intellif avatar Sep 24 '24 05:09 wrongtest-intellif

However, when I try to instantiate new_for_node directly, like this:

node pointer do not take ownership of objects. it seems your new_for_node referenced object is already deconstructed in the previous line, with dangling pointer left.

thanks, exactly, something relevant to the Ownership, now use For instead of ForNode* to take advantage of TVM ObjectRef management.

LeiWang1999 avatar Sep 27 '24 07:09 LeiWang1999