tvm
tvm copied to clipboard
[Unity][Dlight] Enabling Fast and Efficient Kernel Generation by leveraging Hardware information
This pull request serves as an enhancement to Dlight. This update primarily focuses on incorporating hardware information to recommend tile candidates, which enables fast tuning.
Below is a brief summary of the major changes:
-
Introduce dl.ApplyFastTuning Pass
-
new flag for skip_simplify
schedule::reindex: This addition can help avoid over-optimization of eliminating unit loops. This flag functions similarly topreserve_unit_loop. -
improve compute_inline.cc to enhance the simplification on some complex inline case (e.g. layout transform)
-
simple bug fixes:
- #16406
- #16437
Related discussion: https://discuss.tvm.apache.org/t/dlight-enabling-fast-and-efficient-kernel-generation-by-leveraging-hardware-information/16273
TODO Items of this pull request:
- [ ] provide related testing.
- [ ] code style may need guidance (e.g. remove or replace the log print; refactor some python components to cpp)
- [x] leverage structural equal cache to avoid duplicated tuning.
- [x] implement mma schedule template with swizzling.
- [x] support dynamic symbolic tuning
- [ ] bring it to mlc-llm (maybe should improve our design to support dynamic symbolic).