tvm
tvm copied to clipboard
[DietCode] Local Padding
This PR is for the code generation changes required for dynamic MetaScheduler (see apache/tvm-rfcs#72 for the RFC, #11516 for the tracking issue describing the changes). Any feedback or comments are welcome.
FYI, @comaniac @junrushao1994
Also cc @Hzfengsy @vinx13 @spectrometerHBH @masahi
Per offline discussion with @junrushao1994 and @ArmageddonKnight, here is the current action items:
- The local padding pass will be moved to TIR transformation, meaning that local padding becomes an implicit transformation similar to loop partitioning. A config will be exposed to control whether to turn on or off (default off) to keep all current workloads unchanged.
- In the local padding implementation, the logic related to var node name hints will be improved to leverage a more reliable factor (e.g., pointer reference).
@junrushao1994 @Hzfengsy I have finished the revision. Please have a second look when you have time.
Also cc @comaniac
It seems that for some reason the CI build is stopped (as I am unable to query the current CI status), would it be possible to re-trigger the CI.
@tvm-bot rerun
hi, @ArmageddonKnight it seems the tvm transform config "tir.enable_local_pad "does not work since the same schedule build result kernel src code are the same when set config tir.enable.lcal_pad true/false, when i use the test example you upload before, example code will show belows:
def save_kernel_source(kernel, log_kernel_filename): kernel_src=kernel.imported_modules[0].get_source() if log_kernel_filename is not None: with open(log_kernel_filename, 'w') as fout: fout.write("{}".format(kernel_src)) else: print({}.format(kernel_src))
@tvm.testing.requires_gpu @tvm.testing.requires_cuda def test_dense_local_padding(): """ Test that local padding is delivering the correct compute outcome. """ x_np = np.random.uniform(-0.1, 0.1, size=(960, 770)).astype(np.float32) w_np = np.random.uniform(-0.1, 0.1, size=(770, 2304)).astype(np.float32) y_np = np.matmul(x_np, w_np) y_empty = np.empty(shape=y_np.shape, dtype=y_np.dtype) tir_sched = Schedule(Dense_960x770x2304) sample_dense_sched(tir_sched) with tvm.transform.PassContext(config={"tir.enable_local_pad": False}): nopad_cuda_kernel = tvm.build(tir_sched.mod["main"], [], target="cuda") save_kernel_source(nopad_cuda_kernel, "nolocalpad_kernel.cu") with tvm.transform.PassContext(config={"tir.enable_local_pad": True}): cuda_kernel = tvm.build(tir_sched.mod["main"], [], target="cuda") save_kernel_source(cuda_kernel, "localpad_kernel.cu")
cuda_ctx = tvm.cuda()
module_data = [x_np, w_np, y_empty]
module_data = [tvm.nd.array(d, device=cuda_ctx) for d in module_data]
cuda_kernel(*module_data)
np.testing.assert_allclose(module_data[-1].numpy(), y_np, atol=1e-3, rtol=1e-3)
the localpad_kernel.cu are same with nolocalpad_kernel.cu
@renfeier The reason is ebcause we are refactoring the implementation, so the pass itself is temporarily commented out. Sorry I was quite busy with university business and will finish the refactoring recently.
refactoring @ArmageddonKnight Thank you for the prompt reply. Looking forward to your update
@junrushao1994 As was discussed, I have fixed the implementation. Please review it again.
Hmm ... seems that the Cortex CI pipelines are always interrupted for some reason, and this is happening on the main branch as well.
@junrushao1994 The refactored implementation has passed the CI tests. Please review it when you have time available. Thanks.
Hi @junrushao , it has been sometime since this PR is submitted. May I know whether there are any updates on this? And whether further changes are required?
@ArmageddonKnight @junrushao What is the status of this PR or DietCode upstreaming in general? I'm interested in dynamic shape tuning, and I can help this effort.
This looks similar to https://github.com/apache/tvm/pull/12750, maybe we don't need this? cc @vinx13
@masahi PadEinsum can achieve something similar since the padding is in the shared memory