tvm icon indicating copy to clipboard operation
tvm copied to clipboard

[RFC][Tracking Issue] Meta Schedule (AutoTIR)

Open junrushao opened this issue 4 years ago • 5 comments

This is a global tracking issue for landing the meta schedule. The RFC can be found here.

Steps

The steps are numbered following TensorIR (#7527).

[M3a] Core infrastructure

  • [x] Instruction & Trace #8615
  • [x] TracedSchedule #8623
  • [x] Sampler #8642 #8817
  • [x] Design space generator #9079
  • [x] Search strategy #9132
  • [x] Task Scheduler #9154
  • [x] Tune Context #9053

[M3b] Enable measurement

  • [x] Argument Info #9059
  • [x] Builder; Builder input/result #9044
  • [x] Runner; Runner input/result #9111
  • [x] Tuning Record; Database #9061

[M3c] Enhance search

  • [x] ScheduleRule, Mutator, PostProcessor #9761 #9789
  • [x] Cost model #9859 #9789
  • [x] Feature extraction #9760 #9860
  • [x] Measure callback #9780

[M4a] Performance & Coverage

Schedule Rules

  • [x] Add-RFactor #9975
  • [x] Auto-Inline #9943
  • [x] Cross-Thread-Reduction #9994
  • [x] Multi-Level-Tiling #10043
  • [x] Parallel-Vectorize-Unroll #10033
  • [x] Random-Compute-Location #9940

PostProcessors

  • [x] Disallow-Dynamic-Loop #9997
  • [x] Rewrite-Cooperative-Fetch #10081
  • [x] Rewrite-Parallel-Vectorize-Unroll #10071
  • [x] Rewrite-Reduction-Block #10013
  • [x] Rewrite-Unbound-Block #10027
  • [x] Verify-GPU-Code #9945

Mutators

  • [x] Mutate-Compute-Location #10028
  • [x] Mutate-Parallel #10096
  • [x] Mutate-Tile-Size #10092
  • [x] Mutate-Unroll #10045

User interface

  • [x] Tune-TE #10079
  • [x] Tune-TIR #10079
  • [x] Tune-Relay #10079

Misc

  • [x] Local Runner #9153
  • [x] Design-Space-Generator: Post-Order-Apply #9761
  • [x] SearchStrategy: Replay-Func (random search) #9799
  • [x] SearchStrategy: Evolutionary-Search #9836

[M4b] Relay integration

  • [x] Task extraction #9382
  • [x] Apply-History-Best #10049
  • [x] Builder/Runner working with Relay and Relay BYOC #9757 #10055

M5. Operator coverage with all backends for auto tensorization

Being able to tensorize on all the backends

  • [x] TIR primitive: Re-Index #11515
  • [x] TIR primitive: Transform-Block-Layout #11485
  • [x] MetaSchedule auto tensorization helper: TileWithTensorIntrin #11050 #11075
  • [x] MetaSchedule: enhance Multi-Level Tiling #12059 #12113
  • [x] MetaSchedule: Rewrite-Tensorize #11088
  • [x] Analysis: MappingProposer and AutoTensorizeComparator #11740
  • [x] Intel VNNI / ARM dot variants #11088

M6. Memory optimization

Important for CUDA performance, not CPU. Not related to functionality.

  • [ ] TIR primitive: Read/Write-at
  • [ ] Support ewise fusion in MemHammer
  • [ ] Cover non-fp16, non-wmma usecases
  • [x] Shared memory auto padding #12759
  • [ ] Global memory coalescing
  • [ ] Shared ⇒ WMMA, WMMA ⇒ shared/global rewriting
  • [x] Insert caching stage #12355

M7. Unblock end-to-end experiments

  • [ ] Handle reshape fusion
  • [ ] Develop scripts to run experiment
  • [ ] Benchmark on the selected operator set (C1D, C2D, C3D, CAP, DIL, GMM, GRP, T2D)
  • [ ] Performance alignment attempt

M8. Broader Set of Intrinsics and Optimization

  • [x] async pipeline #11368
  • [ ] Permuted layout
  • [x] LDMatrix / MMA #11355

junrushao avatar Jul 14 '21 18:07 junrushao

@junrushao1994 ,

In looking for auto-tensorization ability of TVM (to explore search for accelerators designs & custom ISA) permit me to ask:

  • Was Auto Tensorization removed form this list (was at section [M4b] if I recall), what was/is the plan with ?
  • Also regarding of design plan, will/have something in common with principles of https://arxiv.org/abs/2101.08458 ?

Thank You !

cbalint13 avatar Jan 26 '22 15:01 cbalint13

Hey @cbalint13 thanks for asking! Absolutely!

Was Auto Tensorization removed form this list (was at section [M4b] if I recall), what was/is the plan with ?

The only reason is that I'm trying to organize the roadmap. Auto tensorization is a huge item and we want to have a separate tracking issue for it. As you already see, we have been upstreaming auto tensorization-related PRs, including #9871 #10066. My branch also contains auto tensorization-related working examples if you want to try them out now :-)

Also regarding of design plan, will/have something in common with principles of https://arxiv.org/abs/2101.08458?

This work is done by my fellow colleagues, and of course we are aware, and we have a lot in common :-) Their codebase is public here. The difference here is that we are now using TensorIR, a more powerful and systematic IR/scheduling system to support tensorization

junrushao avatar Jan 26 '22 18:01 junrushao

Hey @cbalint13 thanks for asking! Absolutely!

@junrushao1994

First, thanks a lot for your time !

  • I am very happy even just to witness what is going on recently in TVM (on mind blowing pace).

Was Auto Tensorization removed form this list (was at section [M4b] if I recall), what was/is the plan with ?

The only reason is that I'm trying to organize the roadmap. Auto tensorization is a huge item and we want to have a separate tracking issue for it. As you already see, we have been upstreaming auto tensorization-related PRs, including #9871 #10066. My branch also contains auto tensorization-related working examples if you want to try them out now :-)

  • I see now, thanks for clarification, noticed the recent "blockize - tensorize" PR (quite a large piece, still diving on it).

Also regarding of design plan, will/have something in common with principles of https://arxiv.org/abs/2101.08458?

This work is done by my fellow colleagues, and of course we are aware, and we have a lot in common :-) Their codebase is public here. The difference here is that we are now using TensorIR, a more powerful and systematic IR/scheduling system to support tensorization

  • Was familiar that code-base for UNIT, it is good to know that such feature will make it into the new TIR.
  • I am thinking on framework (early public sketch) that emits HDL (verilog) blocks reusable and/or as cpu-isa extensions in many possible forms sampled within some combinatorial search-space and auto-tensorisation would be key process in evaluation and metrics here.
  • It may end sampling some very wierd-looking hardware (including systolic blocks) so auto-tensorizer might need enhancement on some more challenging ends (as i already looked at UNIT).

Can't wait to try it, will look into mentioned WiP early branch.

Many thanks again !

cbalint13 avatar Jan 26 '22 19:01 cbalint13

Thank you @cbalint13 for your kind response! We are super excited to hear about your work and more than happy to assist/collaborate on TensorIR/MetaSchedule!

junrushao avatar Jan 27 '22 02:01 junrushao

Would be good to get a status update @junrushao1994 . I would suggest we move followup non-infra part to separate trackings to keep things tracable.

tqchen avatar Jul 26 '22 19:07 tqchen