[RELAX][PASS] Annotate Custom Scope layout pass for Adreno GPU
Refer
https://discuss.tvm.apache.org/t/rfc-annotate-custom-scope-layout-relax-pass-for-adreno-gpu/18052/6
for details about texture scope handling.
@tvm-bot rerun
@Hzfengsy do you mind take a look given it touches FuseOps/TIR
also cc @yongwww for memory scope related changes
@tvm-bot rerun
Thanks @srkreddy1238 for updates. I take a closer look and now understands the motivation behind add_attributes. This is mainly to handle the case of conv2d operators where texture can be supported.
However, attaching op attributes into the call_tir indeed introduce less desirable impact, as the specification of call_tir originally do not have to deal with these attributes, and having them will results in "leak through". This would increase the surface area for developers working with call_tir
I also now understand the demand is to enable the finally fused call_tir function to decide whether texture memory is feasible.
I think it is more cleaner to try a different approach. Instead of relying on legalize pass, let us introduce an adreno specific conv_dispatch which can be used before legalize, to offload these conv operators. We specifically attach the attribute tir.opencl_texture_2d_supported = true to the call node.
Now the remaining question is where the schedule can appear
- The most clean way is to actually have
relax.andreno.conv_dispatchto call the dlight schedule and construct such call_tir, and mark it as already scheduled. The only issue is that in such case followup FuseOps/TIR should treat this as opaque, and do not yet have capabilities to run more fusions. But we should be fine getting the right conv2d op scheduled
To further enable fusion, one can try adopt the following customized legalize sequence
- S0: relax.andreno.conv_dispatch: run conv dispatch and mark it as opaque with tir.opencl_texture_2d_supported = true
- S1: Run legalize and analysis
- S2: Do a pattern match to manually fuse the ewise onto the conv2d (by creating a sub function that calls into conv2d then ewise), this will create a sub function that calls into conv2d and then ewise, which can then be consumed by FuseTIR
- Run FuseOps (this will try to fuse the other ops)
- Run FuseTIR
- Run dlight
Off late realized, I could have drafted an RFC to describe the approach. Have done now https://discuss.tvm.apache.org/t/rfc-annotate-custom-scope-layout-relax-pass-for-adreno-gpu/18052
@tqchen thanks for the thoughts. Few concerns I have in this approach
tir.opencl_texture_2d_supported = true: I assume this flag will be used to realize VDevice in struct_info after FuseTIR. Then, only flag may not be sufficient here we might need scope information for each input. And this information to be consistent while we pass through the fusion ops.- Another moderate challenge is in
S2, where we need to define and maintain BYOC like pattern table to ensure maximum fusion possibilities.
Pls advice.
Updated the RFC @ https://discuss.tvm.apache.org/t/rfc-annotate-custom-scope-layout-relax-pass-for-adreno-gpu/18052/6
@tvm-bot rerun
@tqchen can you take a look at this ?
@tvm-bot rerun
@tqchen please handle this before any conflicts. Corresponding PR's for TIR lowering and runtime are waiting on this.
@tqchen
One catch I came across in the current implementation is representing return type sinfo_args for ops like conv2d in tvmscript.
lv2: R.Tensor((2, 8, 54, 54, 4), dtype="float32", vdevice="opencl:1:global") = R.nn.conv2d(lv, lv1, strides=[1, 1], padding=[0, 0, 0, 0], dilation=[1, 1], groups=1, data_layout="NCHW4c", kernel_layout="OIHW4o", out_layout="NCHW4c", out_dtype="float32", sinfo_args=(R.Tensor((2, 8, 54, 54, 4), dtype="float32", vdevice="opencl:1:global"),))
Any advice ?
sorry has been busy with some stuffs, should be able to do a pass around may
Sorry for delaying get back because of the recent focus on FFI refactoring, I think the main remaining item is the following change
StructInfo infered_sinfo = infer_struct_info_map[op](GetRef<Call>(call), builder_); legalized = UpdateVDeviceOutStructInfo(legalized, visited_call, infered_sinfo);
Where the struct info inference function was called during legalization (and I commented on that previously which might be overlooked). The primary principle we had is to automating struct info deduction during building, so in theory every pre-visit call already have struct info populated, and most rewrite should not change the struct info post visit. Doing explicit deduction within a rewrite a is a pattern we want to avoid
Sorry for delaying get back because of the recent focus on FFI refactoring, I think the main remaining item is the following change
StructInfo infered_sinfo = infer_struct_info_map[op](GetRef(call), builder_); legalized = UpdateVDeviceOutStructInfo(legalized, visited_call, infered_sinfo);
Where the struct info inference function was called during legalization (and I commented on that previously which might be overlooked). The primary principle we had is to automating struct info deduction during building, so in theory every pre-visit call already have struct info populated, and most rewrite should not change the struct info post visit. Doing explicit deduction within a rewrite a is a pattern we want to avoid
Understood and the challenge lies in the legalization utils defined below which sets the output_sinfo (aka Vdevice info) derived from any one of its arguments instead of inferred from original operator infer API. I will improve this part ...
https://github.com/apache/tvm/blob/437d00afcd5496700703427cc4ce316a35d35baa/python/tvm/relax/utils.py#L324
@tqchen we are good now with recommended changes. Pls take a look
@tqchen rebased and up for review.