tvm
tvm copied to clipboard
[TIR] Enhance Lower cross thread Pass
We currently only support lower cross thread with several constrains. For example, the lower_cross_thread only apples when the thread binding reduced axis is the innermost loop, and the block must have an init block. This can be a limiting for some cases.
For example, when tensorizing the reduction block (e.g., dp4a or mma), it becomes difficult to tensorize the init statement as well:
with T.block("block"):
vi = T.axis.spatial(2, i_0 * 16 + i_1)
vk = T.axis.reduce(32, k_0 * 64 + k_1)
T.where(i_0 * 16 + i_1 < 2 and k_0 * 64 + k_1 < 32)
T.reads(A[vi, vk])
T.writes(B[vi])
with T.init():
B[vi] = T.float32(0)
B[vi] = B[vi] + A[vi, vk]
Moreover, certain cases, like small gemm, prefer block reduction in shared memory to enhance parallelization to better utilize the hardware resources.
This pull request improves the lower_cross_thread pass, it can now handle the thread block reduce lowering with separate init and reduce blocks, and removes the constrain that the reduced axis is the innermost loop to support TensorCore with block reduction.
relevant test cases can be found at tests/python/tir-transform/test_tir_transform_lower_cross_thread_reduction.py.
Please CC @MasterJH5574 .
@LeiWang1999 please fix the lint and test case, @wrongtest-intellif do you mind help review the PR
I’m attempting to remove theLoopVar with ForNode, and I’ve encountered an unexpected behavior.
For new_for = For(ax_lane_id, Integer(0), warp_size, ForKind::kThreadBinding, n->body);
const ForNode* new_for_node = new_for.get();
LOG(INFO) << "new_for->min " << new_for_node->min;
// Output: 0
This snippet works fine and logs the expected output of 0 for new_for->min.
However, when I try to instantiate new_for_node directly, like this:
const ForNode* new_for_node = For(ax_lane_id, Integer(0), warp_size, ForKind::kThreadBinding, n->body).get();
LOG(INFO) << "new_for->min " << new_for_node->min;
// Check failed: (tindex < type_table_.size() && type_table_[tindex].allocated_slots != 0) is false: Unknown type index 76391888
@wrongtest-intellif, do you have any thoughts?
However, when I try to instantiate
new_for_nodedirectly, like this:
node pointer do not take ownership of objects. it seems your new_for_node referenced object is already deconstructed in the previous line, with dangling pointer left.
However, when I try to instantiate
new_for_nodedirectly, like this:node pointer do not take ownership of objects. it seems your
new_for_nodereferenced object is already deconstructed in the previous line, with dangling pointer left.
thanks, exactly, something relevant to the Ownership, now use For instead of ForNode* to take advantage of TVM ObjectRef management.