[RFC][Tracking Issue] Meta Schedule (AutoTIR)
This is a global tracking issue for landing the meta schedule. The RFC can be found here.
Steps
The steps are numbered following TensorIR (#7527).
[M3a] Core infrastructure
- [x] Instruction & Trace #8615
- [x] TracedSchedule #8623
- [x] Sampler #8642 #8817
- [x] Design space generator #9079
- [x] Search strategy #9132
- [x] Task Scheduler #9154
- [x] Tune Context #9053
[M3b] Enable measurement
- [x] Argument Info #9059
- [x] Builder; Builder input/result #9044
- [x] Runner; Runner input/result #9111
- [x] Tuning Record; Database #9061
[M3c] Enhance search
- [x] ScheduleRule, Mutator, PostProcessor #9761 #9789
- [x] Cost model #9859 #9789
- [x] Feature extraction #9760 #9860
- [x] Measure callback #9780
[M4a] Performance & Coverage
Schedule Rules
- [x] Add-RFactor #9975
- [x] Auto-Inline #9943
- [x] Cross-Thread-Reduction #9994
- [x] Multi-Level-Tiling #10043
- [x] Parallel-Vectorize-Unroll #10033
- [x] Random-Compute-Location #9940
PostProcessors
- [x] Disallow-Dynamic-Loop #9997
- [x] Rewrite-Cooperative-Fetch #10081
- [x] Rewrite-Parallel-Vectorize-Unroll #10071
- [x] Rewrite-Reduction-Block #10013
- [x] Rewrite-Unbound-Block #10027
- [x] Verify-GPU-Code #9945
Mutators
- [x] Mutate-Compute-Location #10028
- [x] Mutate-Parallel #10096
- [x] Mutate-Tile-Size #10092
- [x] Mutate-Unroll #10045
User interface
- [x] Tune-TE #10079
- [x] Tune-TIR #10079
- [x] Tune-Relay #10079
Misc
- [x] Local Runner #9153
- [x] Design-Space-Generator: Post-Order-Apply #9761
- [x] SearchStrategy: Replay-Func (random search) #9799
- [x] SearchStrategy: Evolutionary-Search #9836
[M4b] Relay integration
- [x] Task extraction #9382
- [x] Apply-History-Best #10049
- [x] Builder/Runner working with Relay and Relay BYOC #9757 #10055
M5. Operator coverage with all backends for auto tensorization
Being able to tensorize on all the backends
- [x] TIR primitive: Re-Index #11515
- [x] TIR primitive: Transform-Block-Layout #11485
- [x] MetaSchedule auto tensorization helper:
TileWithTensorIntrin#11050 #11075 - [x] MetaSchedule: enhance Multi-Level Tiling #12059 #12113
- [x] MetaSchedule: Rewrite-Tensorize #11088
- [x] Analysis: MappingProposer and AutoTensorizeComparator #11740
- [x] Intel VNNI / ARM dot variants #11088
M6. Memory optimization
Important for CUDA performance, not CPU. Not related to functionality.
- [ ] TIR primitive: Read/Write-at
- [ ] Support ewise fusion in MemHammer
- [ ] Cover non-fp16, non-wmma usecases
- [x] Shared memory auto padding #12759
- [ ] Global memory coalescing
- [ ] Shared ⇒ WMMA, WMMA ⇒ shared/global rewriting
- [x] Insert caching stage #12355
M7. Unblock end-to-end experiments
- [ ] Handle reshape fusion
- [ ] Develop scripts to run experiment
- [ ] Benchmark on the selected operator set (C1D, C2D, C3D, CAP, DIL, GMM, GRP, T2D)
- [ ] Performance alignment attempt
M8. Broader Set of Intrinsics and Optimization
- [x] async pipeline #11368
- [ ] Permuted layout
- [x] LDMatrix / MMA #11355
@junrushao1994 ,
In looking for auto-tensorization ability of TVM (to explore search for accelerators designs & custom ISA) permit me to ask:
- Was
Auto Tensorizationremoved form this list (was at section [M4b] if I recall), what was/is the plan with ? - Also regarding of design plan, will/have something in common with principles of https://arxiv.org/abs/2101.08458 ?
Thank You !
Hey @cbalint13 thanks for asking! Absolutely!
Was Auto Tensorization removed form this list (was at section [M4b] if I recall), what was/is the plan with ?
The only reason is that I'm trying to organize the roadmap. Auto tensorization is a huge item and we want to have a separate tracking issue for it. As you already see, we have been upstreaming auto tensorization-related PRs, including #9871 #10066. My branch also contains auto tensorization-related working examples if you want to try them out now :-)
Also regarding of design plan, will/have something in common with principles of https://arxiv.org/abs/2101.08458?
This work is done by my fellow colleagues, and of course we are aware, and we have a lot in common :-) Their codebase is public here. The difference here is that we are now using TensorIR, a more powerful and systematic IR/scheduling system to support tensorization
Hey @cbalint13 thanks for asking! Absolutely!
@junrushao1994
First, thanks a lot for your time !
- I am very happy even just to witness what is going on recently in TVM (on mind blowing pace).
Was Auto Tensorization removed form this list (was at section [M4b] if I recall), what was/is the plan with ?
The only reason is that I'm trying to organize the roadmap. Auto tensorization is a huge item and we want to have a separate tracking issue for it. As you already see, we have been upstreaming auto tensorization-related PRs, including #9871 #10066. My branch also contains auto tensorization-related working examples if you want to try them out now :-)
- I see now, thanks for clarification, noticed the recent "blockize - tensorize" PR (quite a large piece, still diving on it).
Also regarding of design plan, will/have something in common with principles of https://arxiv.org/abs/2101.08458?
This work is done by my fellow colleagues, and of course we are aware, and we have a lot in common :-) Their codebase is public here. The difference here is that we are now using TensorIR, a more powerful and systematic IR/scheduling system to support tensorization
- Was familiar that code-base for UNIT, it is good to know that such feature will make it into the new TIR.
- I am thinking on framework (early public sketch) that emits HDL (verilog) blocks reusable and/or as cpu-isa extensions in many possible forms sampled within some combinatorial search-space and auto-tensorisation would be key process in evaluation and metrics here.
- It may end sampling some very wierd-looking hardware (including systolic blocks) so auto-tensorizer might need enhancement on some more challenging ends (as i already looked at UNIT).
Can't wait to try it, will look into mentioned WiP early branch.
Many thanks again !
Thank you @cbalint13 for your kind response! We are super excited to hear about your work and more than happy to assist/collaborate on TensorIR/MetaSchedule!
Would be good to get a status update @junrushao1994 . I would suggest we move followup non-infra part to separate trackings to keep things tracable.