relax
relax copied to clipboard
[USMP] Initial implementation of liveness analysis for Relax + TIR
This PR adds an initial implementation of liveness analysis of tensors/buffers for Relax and TIR programs.
@areusch @mbaret @YuchenJin @mikepapadim
Thanks @gigiblender for integrating USMP into Relax!
One idea about the liveness analysis pass: we can have a memory lifting pass which lifts the memory allocations in TIR into Relax first, and this will allow the liveness analysis pass to analyze only the Relax functions without the need to analyze the TIR Primfuncs in the IRModule. Would love to hear your thoughts. 😄
And one suggestion for the test case construction, we encourage developers to use the block_builder
and emit_te
api to construct the IRModule if the TVMScript is very long, for example: https://github.com/tlc-pack/relax/blob/relax/tests/python/relax/test_transform_fuse_ops.py#L51-L57. This will make the test case more concise.
thanks @YuchenJin !
One idea about the liveness analysis pass: we can have a memory lifting pass which lifts the memory allocations in TIR into Relax first, and this will allow the liveness analysis pass to analyze only the Relax functions without the need to analyze the TIR Primfuncs in the IRModule. Would love to hear your thoughts. 😄
one challenge we have with lifting allocs is that if a TIR PrimFunc has two internal allocs which don't overlap, then we wouldn't be able to detect that solely by looking at Call(relax.builtin.alloc_tensor
. However, I think that we might want to iterate on this PR to derive liveness based on first/last usage rather than just alloc nodes, so maybe this is less of a concern.
one challenge we have with lifting allocs is that if a TIR PrimFunc has two internal allocs which don't overlap, then we wouldn't be able to detect that solely by looking at
Call(relax.builtin.alloc_tensor
.
Thanks @areusch! If we run the MetaSchedule tuning pass or other transformations/schedules first (which is usually the case since memory planning is at the later stage of the compilation), the temporary allocs inside TIR PrimFunc will get removed, so usually there will not be multiple temporary alloc in a TIR PrimFunc. Would love to know the cases where there are several temporary allocs.
hm, i was thinking that you would see this case when doing multi-anchor fusion. I haven't explored that enough yet to know, though. it does seem like there isn't anything in TIR preventing this case from happening though, and if folks are writing custom TIR passes, it might not be sufficient to rely on MetaSchedule to reuse Buffers in TIR. with that said, this might not be as high of a priority if MetaSchedule does do this.
I'm not sure resolving this question changes the approach of modifying the LivenessAnalysis to generate alloc/kill events based on usage. However, it's certainly a good thing to understand further.
hm, i was thinking that you would see this case when doing multi-anchor fusion. I haven't explored that enough yet to know, though. it does seem like there isn't anything in TIR preventing this case from happening though, and if folks are writing custom TIR passes, it might not be sufficient to rely on MetaSchedule to reuse Buffers in TIR. with that said, this might not be as high of a priority if MetaSchedule does do this.
I'm not sure resolving this question changes the approach of modifying the LivenessAnalysis to generate alloc/kill events based on usage. However, it's certainly a good thing to understand further.
Yes, I agree it does not change the general approach. My thought is if there are usually not multiple temporary allocs in a TIR PrimFunc, the liveness analysis pass would just need to traverse the Relax function after memory lifting, which would simplify the assumption and reduce the complexity of the liveness analysis pass by a lot. :)
My thought is if there are usually not multiple temporary allocs in a TIR PrimFunc, the liveness analysis pass would just need to traverse the Relax function after memory lifting, which would simplify the assumption and reduce the complexity of the liveness analysis pass by a lot. :)
Ethos-U is a motivator for this functionality as it doesn't use metaschedule but does have multiple allocates in a single prim func. Doing buffer consolidation on a per-primfunc basis will also be generally less efficient than doing it with global knowledge where the memory fragmentation pattern is known.