tvm [RFC][Tracking Issue] Meta Schedule (AutoTIR)

This is a global tracking issue for landing the meta schedule. The RFC can be found here.

Steps

The steps are numbered following TensorIR (#7527).

[M3a] Core infrastructure

[x] Instruction & Trace #8615
[x] TracedSchedule #8623
[x] Sampler #8642 #8817
[x] Design space generator #9079
[x] Search strategy #9132
[x] Task Scheduler #9154
[x] Tune Context #9053

[M3b] Enable measurement

[x] Argument Info #9059
[x] Builder; Builder input/result #9044
[x] Runner; Runner input/result #9111
[x] Tuning Record; Database #9061

[M3c] Enhance search

[x] ScheduleRule, Mutator, PostProcessor #9761 #9789
[x] Cost model #9859 #9789
[x] Feature extraction #9760 #9860
[x] Measure callback #9780

[M4a] Performance & Coverage

Schedule Rules

[x] Add-RFactor #9975
[x] Auto-Inline #9943
[x] Cross-Thread-Reduction #9994
[x] Multi-Level-Tiling #10043
[x] Parallel-Vectorize-Unroll #10033
[x] Random-Compute-Location #9940

PostProcessors

[x] Disallow-Dynamic-Loop #9997
[x] Rewrite-Cooperative-Fetch #10081
[x] Rewrite-Parallel-Vectorize-Unroll #10071
[x] Rewrite-Reduction-Block #10013
[x] Rewrite-Unbound-Block #10027
[x] Verify-GPU-Code #9945

Mutators

[x] Mutate-Compute-Location #10028
[x] Mutate-Parallel #10096
[x] Mutate-Tile-Size #10092
[x] Mutate-Unroll #10045

User interface

[x] Tune-TE #10079
[x] Tune-TIR #10079
[x] Tune-Relay #10079

Misc

[x] Local Runner #9153
[x] Design-Space-Generator: Post-Order-Apply #9761
[x] SearchStrategy: Replay-Func (random search) #9799
[x] SearchStrategy: Evolutionary-Search #9836

[M4b] Relay integration

[x] Task extraction #9382
[x] Apply-History-Best #10049
[x] Builder/Runner working with Relay and Relay BYOC #9757 #10055

M5. Operator coverage with all backends for auto tensorization

Being able to tensorize on all the backends

[x] TIR primitive: Re-Index #11515
[x] TIR primitive: Transform-Block-Layout #11485
[x] MetaSchedule auto tensorization helper: TileWithTensorIntrin #11050 #11075
[x] MetaSchedule: enhance Multi-Level Tiling #12059 #12113
[x] MetaSchedule: Rewrite-Tensorize #11088
[x] Analysis: MappingProposer and AutoTensorizeComparator #11740
[x] Intel VNNI / ARM dot variants #11088

M6. Memory optimization

Important for CUDA performance, not CPU. Not related to functionality.

[ ] TIR primitive: Read/Write-at
[ ] Support ewise fusion in MemHammer
[ ] Cover non-fp16, non-wmma usecases
[x] Shared memory auto padding #12759
[ ] Global memory coalescing
[ ] Shared ⇒ WMMA, WMMA ⇒ shared/global rewriting
[x] Insert caching stage #12355

M7. Unblock end-to-end experiments

[ ] Handle reshape fusion
[ ] Develop scripts to run experiment
[ ] Benchmark on the selected operator set (C1D, C2D, C3D, CAP, DIL, GMM, GRP, T2D)
[ ] Performance alignment attempt

M8. Broader Set of Intrinsics and Optimization

[x] async pipeline #11368
[ ] Permuted layout
[x] LDMatrix / MMA #11355

Jul 14 '21 18:07 junrushao

@junrushao1994 ,

In looking for auto-tensorization ability of TVM (to explore search for accelerators designs & custom ISA) permit me to ask:

Was Auto Tensorization removed form this list (was at section [M4b] if I recall), what was/is the plan with ?
Also regarding of design plan, will/have something in common with principles of https://arxiv.org/abs/2101.08458 ?

Thank You !

Jan 26 '22 15:01 cbalint13

Hey @cbalint13 thanks for asking! Absolutely!

Was Auto Tensorization removed form this list (was at section [M4b] if I recall), what was/is the plan with ?

The only reason is that I'm trying to organize the roadmap. Auto tensorization is a huge item and we want to have a separate tracking issue for it. As you already see, we have been upstreaming auto tensorization-related PRs, including #9871 #10066. My branch also contains auto tensorization-related working examples if you want to try them out now :-)

Also regarding of design plan, will/have something in common with principles of https://arxiv.org/abs/2101.08458?

This work is done by my fellow colleagues, and of course we are aware, and we have a lot in common :-) Their codebase is public here. The difference here is that we are now using TensorIR, a more powerful and systematic IR/scheduling system to support tensorization

Jan 26 '22 18:01 junrushao

Hey @cbalint13 thanks for asking! Absolutely!

@junrushao1994

First, thanks a lot for your time !

I am very happy even just to witness what is going on recently in TVM (on mind blowing pace).

Was Auto Tensorization removed form this list (was at section [M4b] if I recall), what was/is the plan with ?

The only reason is that I'm trying to organize the roadmap. Auto tensorization is a huge item and we want to have a separate tracking issue for it. As you already see, we have been upstreaming auto tensorization-related PRs, including #9871 #10066. My branch also contains auto tensorization-related working examples if you want to try them out now :-)

I see now, thanks for clarification, noticed the recent "blockize - tensorize" PR (quite a large piece, still diving on it).

Also regarding of design plan, will/have something in common with principles of https://arxiv.org/abs/2101.08458?

This work is done by my fellow colleagues, and of course we are aware, and we have a lot in common :-) Their codebase is public here. The difference here is that we are now using TensorIR, a more powerful and systematic IR/scheduling system to support tensorization

Was familiar that code-base for UNIT, it is good to know that such feature will make it into the new TIR.
I am thinking on framework (early public sketch) that emits HDL (verilog) blocks reusable and/or as cpu-isa extensions in many possible forms sampled within some combinatorial search-space and auto-tensorisation would be key process in evaluation and metrics here.
It may end sampling some very wierd-looking hardware (including systolic blocks) so auto-tensorizer might need enhancement on some more challenging ends (as i already looked at UNIT).

Can't wait to try it, will look into mentioned WiP early branch.

Many thanks again !

Jan 26 '22 19:01 cbalint13

Thank you @cbalint13 for your kind response! We are super excited to hear about your work and more than happy to assist/collaborate on TensorIR/MetaSchedule!

Jan 27 '22 02:01 junrushao

Would be good to get a status update @junrushao1994 . I would suggest we move followup non-infra part to separate trackings to keep things tracable.

Jul 26 '22 19:07 tqchen