tvm
tvm copied to clipboard
[TOPI] Add layer norm operator
trafficstars
This PR added a tuple-sum based implementation of layer norm. It performs one-pass reduction to compute mean and variance at the same time.
Reducer pattern is also added to allow LowerCrossThreadReduction to handle this case.
On CUDA, it will generate two kernels: one for reduction and one for elemwise operations. Because of some limitation of compute_at currently we are not able to fuse them into one kernel.
cc @MasterJH5574 @junrushao @AndrewZhaoLuo