relax [Tracking Issue] Relax graph-level BYOC

[Tracking Issue] Relax graph-level BYOC

Open masahi opened this issue 2 years ago • 0 comments

I've been working on bringing up BYOC infra in Relax, building on the work of @sunggg and the pattern matcher work from @ganler. The ultimate goal is to make relax.vm.build(mod, "cuda") just work without tuning and with reasonable out-of-the-box performance. Also it would be the first step toward performant dynamic-shape support.

My branch is here and currently I have minimal test cases for offloading a simple subgraph to DNNL and CUTLASS. I'm going to start sending pieces from it from today. https://github.com/tlc-pack/relax/compare/relax...masahi:codegen-cutlass?expand=1

[x] Refactor RunCodegen pass to send all BYOC functions to the backend at once (rather than individually)
[x] Add pattern-based partitioning pass (similar to MergeComposite in Relay)
[x] ~~Add pass to wrap and annotate the partitioned function for offloading~~ (subsumed by https://github.com/tlc-pack/relax/pull/372)
[x] Add DNNL backend
[x] Add CUTLASS backend
[x] Add pass to merge neighboring calls to functions compiled for the same external backend into one function (similar to MergeCompilerRegion in Relay, necessary for TRT)
[x] Revisit TensorRT backend (originally added by https://github.com/tlc-pack/relax/pull/164)

Future possibilities (time permitting)

[ ] Add cuDNN backend (supporting Graph API)
[ ] Add oneDNN (aka dnnl) v3 graph API backend
[ ] Advanced fusion, such as fused MHA
[ ] Take advantage of graph-level passes (constant folding, scale axis folding, layout transformation etc) when they become available
[x] Add mechanism to handle constants (recurring problems in Relay BYOC) (Initial work in https://github.com/tlc-pack/relax/pull/400, not sure if it is complete)
[ ] Improve each backend (more patterns, e2e eval etc)

@sunggg @YuchenJin @tqchen @junrushao

Jan 18 '23 21:01 masahi

relax relax copied to clipboard

[Tracking Issue] Relax graph-level BYOC

relax
relax copied to clipboard