pytorch
pytorch copied to clipboard
bookend `view` should be stripped from fusion
🚀 The feature, motivation and pitch
Currently the handling of view in scheduler is sub-optimal.
For views inside the fusion group that connects fusion, it makes sense, since this usually gives us a bigger graph and could be beneficial to fusion.
However, for view ops that surrounds a fusion, there's really no good reason to fuse those. Since view op is only a meta operation, leaving it outside of fusion could give us better codegen perf for now.
A few concrete issues that was raised by @kevinstephano:
- Here’s another example of a fusion we just might not want to send to nvFuser as this is a transpose after Matmul2 and before the Output Linear of Multihead attention that is resulting in 2 kernels. The 2 kernels take ~50 us whereas the single Transpose kernel in eager ~40us.
def nvfuser_fusion_id7(fd : FusionDefinition) -> None :
T0 = fd.define_tensor(symbolic_sizes=[-1, -1, -1], contiguous=[True, True, True], dtype=DataType.Float)
T1 = fd.ops.view(T0, original_shape=[768, 128, 64], new_shape=[64, 12, 128, 64])
T2 = fd.permute(T1)
T3 = fd.ops.set(T2)
T4 = fd.ops.view(T3, original_shape=[64, 128, 12, 64], new_shape=[64, 128, 768])
T5 = fd.ops.view(T4, original_shape=[64, 128, 768], new_shape=[8192, 768])
fd.add_output(T5)
Arguments for fusion8:
Inputs:
tensor dtype: float sizes: (768, 128, 64, ) stride: (8192, 64, 1, ) pointer: 0x7f53a6000000
Outputs:
Launch Parameters: BlockDim.x = 128, BlockDim.y = -1, BlockDim.z = -1, GridDim.x = -1, GridDim.y = -1, GridDim.z = -1, Smem Size = 0
- In the Feed-Forward Block, Gelu is getting decomposed and re-fused. It is also sucking in the bookend views. The result is a kernel that is running at 0.62x of the Eager Mode Gelu Kernel. 191 us vs 118 us.
def nvfuser_fusion_id10(fd : FusionDefinition) -> None :
T0 = fd.define_tensor(symbolic_sizes=[-1, -1], contiguous=[True, True], dtype=DataType.Float)
T1 = fd.ops.view(T0, original_shape=[8192, 3072], new_shape=[64, 128, 3072])
S2 = fd.define_constant(0.500000)
T3 = fd.ops.mul(T1, S2)
S4 = fd.define_constant(0.707107)
T5 = fd.ops.mul(T1, S4)
T6 = fd.ops.erf(T5)
S7 = fd.define_constant(1.00000)
T8 = fd.ops.add(T6, S7)
T9 = fd.ops.mul(T3, T8)
T10 = fd.ops.view(T9, original_shape=[64, 128, 3072], new_shape=[8192, 3072])
fd.add_output(T1)
fd.add_output(T10)
Arguments for fusion12:
Inputs:
tensor dtype: float sizes: (8192, 3072, ) stride: (3072, 1, ) pointer: 0x7f5148800000
Outputs:
Launch Parameters: BlockDim.x = 128, BlockDim.y = -1, BlockDim.z = -1, GridDim.x = -1, GridDim.y = -1, GridDim.z = -1, Smem Size = 0
Current plan is to add a quick pass that strips view-like op on the boundary of fusion. We'll see how that would impact the perf.
Alternatives
No response
Additional context
No response
I have a PR enabling this, along with some clean up on Ivan's refactoring on getitem special handling. #2099
Repro of permute based fusion:
import torch
from torch._C._nvfuser import Fusion, FusionDefinition, DataType
def nvfuser_fusion_id0(fd : FusionDefinition) -> None :
T0 = fd.define_tensor(symbolic_sizes=[-1, -1, -1], contiguous=[True, True, True], dtype=DataType.Float)
T1 = fd.ops.view(T0, original_shape=[768, 128, 64], new_shape=[64, 12, 128, 64])
T2 = fd.ops.permute(T1, dims=[0, 2, 1, 3])
T3 = fd.ops.set(T2)
T4 = fd.ops.view(T3, original_shape=[64, 128, 12, 64], new_shape=[64, 128, 768])
T5 = fd.ops.view(T4, original_shape=[64, 128, 768], new_shape=[8192, 768])
fd.add_output(T5)
fs = Fusion()
with FusionDefinition(fs) as fd :
nvfuser_fusion_id0(fd)
inputs = [
torch.randn(768, 128, 64, device='cuda')
]
for _ in range(5):
out = fs.execute(inputs)
This is a more complicated operation that is a transpose with bookend views. The resulting fusion is segmented into two kernels.
Here is a second example that should be easier as the second view undoes the first view and I would start here. This is a Gelu operation from the Feed Forward network of a Transformer network.
import torch
from torch._C._nvfuser import Fusion, FusionDefinition, DataType
def nvfuser_fusion_id0(fd : FusionDefinition) -> None :
T0 = fd.define_tensor(symbolic_sizes=[-1, -1], contiguous=[True, True], dtype=DataType.Float)
T1 = fd.ops.view(T0, original_shape=[8192, 3072], new_shape=[64, 128, 3072])
S2 = fd.define_constant(0.500000)
T3 = fd.ops.mul(T1, S2)
S4 = fd.define_constant(0.707107)
T5 = fd.ops.mul(T1, S4)
T6 = fd.ops.erf(T5)
S7 = fd.define_constant(1.00000)
T8 = fd.ops.add(T6, S7)
T9 = fd.ops.mul(T3, T8)
T10 = fd.ops.view(T9, original_shape=[64, 128, 3072], new_shape=[8192, 3072])
fd.add_output(T10)
fs = Fusion()
with FusionDefinition(fs) as fd :
nvfuser_fusion_id0(fd)
inputs = [
torch.randn(8192, 3072, device='cuda')
]
for _ in range(5):
out = fs.execute(inputs)
Perf:
1534646325 196768 334 49152 1 1 128 1 1 28 0.000 0.000 NVIDIA A100 80GB PCIe (0) 1 7 CudaCodeGen::kernel1(CudaCodeGen::Tensor<float, (int)2>, CudaCodeGen::Tensor<float, (int)3>, CudaCo…
1534844117 186913 344 6 8192 1 128 1 1 16 0.000 0.000 NVIDIA A100 80GB PCIe (0) 1 7 CudaCodeGen::kernel2(CudaCodeGen::Tensor<float, (int)3>, CudaCodeGen::Tensor<float, (int)3>, CudaCo…
A100 Kernel Time = 197us + 187us = 384us
Repro no views:
import torch
from torch._C._nvfuser import Fusion, FusionDefinition, DataType
def nvfuser_fusion_id0(fd : FusionDefinition) -> None :
T0 = fd.define_tensor(symbolic_sizes=[-1, -1, -1], contiguous=[True, True, True], dtype=DataType.Float)
S2 = fd.define_constant(0.500000)
T3 = fd.ops.mul(T0, S2)
S4 = fd.define_constant(0.707107)
T5 = fd.ops.mul(T0, S4)
T6 = fd.ops.erf(T5)
S7 = fd.define_constant(1.00000)
T8 = fd.ops.add(T6, S7)
T9 = fd.ops.mul(T3, T8)
fd.add_output(T9)
fs = Fusion()
with FusionDefinition(fs) as fd :
nvfuser_fusion_id0(fd)
inputs = [
torch.randn(64, 128, 3072, device='cuda')
]
for _ in range(5):
out = fs.execute(inputs)
Perf:
1326379136 127169 226 49152 1 1 128 1 1 29 0.000 0.000 NVIDIA A100 80GB PCIe (0) 1 7 CudaCodeGen::kernel1(CudaCodeGen::Tensor<float, (int)3>, CudaCodeGen::Tensor<float, (int)3>)
A100 Kernel time = 127us.
Note, I think Christian made an effort to fix splitting the views into two kernels, but the perf did not match the non-view perf. I believe this was his code branch:
https://github.com/csarofeen/pytorch/commits/view_reduction
This PR shows the differences between the branch and the devel branch so you can see what was changed.
https://github.com/csarofeen/pytorch/pull/2039
I looked at this example using PYTORCH_NVFUSER_DUMP=segmenter_logging as well as a little snooping with gdb. It seems that the segmenter hits the second view and has to segment. The reason is that the pointwise scheduler rejects it, due to not being able to find a reference tv to use in parallelizeAll. See here: https://github.com/csarofeen/pytorch/blob/devel/third_party/nvfuser/csrc/scheduler/pointwise.cpp#L42
That check fails because DomainMap::isValidReference basically is checking that all of the input IterDomains map to it, but those iter domains all get scrambled during the view operation.
I don't understand why the line after doesn't get put in the first segment. See the "NOTE" in this copy of Kevin's Fusion
def nvfuser_fusion_id0(fd: FusionDefinition, insert_views: bool = True) -> None:
T0 = fd.define_tensor(
symbolic_sizes=[-1, -1], contiguous=[True, True], dtype=DataType.Float
)
T1 = T0
if insert_views:
T1 = fd.ops.view(T0, original_shape=[8192, 3072], new_shape=[64, 128, 3072])
S2 = fd.define_constant(0.500000)
T3 = fd.ops.mul(T1, S2)
S4 = fd.define_constant(0.707107)
T5 = fd.ops.mul(T1, S4)
T6 = fd.ops.erf(T5)
S7 = fd.define_constant(1.00000)
T8 = fd.ops.add(T6, S7)
# NOTE: this is where the segmenting is happening with views
T9 = fd.ops.mul(T3, T8)
T10 = T9
if insert_views:
T10 = fd.ops.view(T9, original_shape=[64, 128, 3072], new_shape=[8192, 3072])
fd.add_output(T10)
If the line after got placed in kernel1 then kernel2 would be a no-op I think
Studying how views affect segmentation with this sequential fusion:
def simple_fusion(fd: FusionDefinition) -> None:
T = fd.define_tensor(
symbolic_sizes=[-1, -1], contiguous=[True, True], dtype=DataType.Float
)
current_shape = [8192, 3072]
other_shape = [64, 128, 3072]
def insert_view(_T):
"""Insert a view that swaps between the two shapes above"""
nonlocal current_shape, other_shape
new_shape = other_shape
out = fd.ops.view(_T, original_shape=current_shape, new_shape=new_shape)
other_shape, current_shape = current_shape, other_shape
return out
T = insert_view(T) # A
T = fd.ops.add(T, T)
T = insert_view(T) # B
T = fd.ops.mul(T, T)
T = insert_view(T) # C
fd.add_output(T)
Running this as is gives three segments: view+add, view only, mul+view.
When we comment out A we get 2 segments: add+view+mul, view only
When we comment out B we get 2 segments: view+add+mul, view only
When we comment out C we get 2 segments: view+add, view+mul
When we comment out A+B we get 1 segment: add+mul+view
When we comment out A+C we get 1 segment: add+view+mul
When we comment out B+C we get 1 segment: view+add+mul
When we comment them all out we get 1 segment: add+mul
It's interesting that when removing A but leaving B, we are able to fuse into a single segment add+view+mul; without A, C only adds a "view-only segment" which could be removed pretty easily since it's a metadata operation. However, since the leading view A is allowed to fuse with subsequent expressions, when we add it in it disrupts the add+view+mul fusion. If there were a registered scheduler that tried to detect metadata-only operations on inputs or outputs, then depending on implementation it could give us a single kernel in all of these cases. But the way findSegments is implemented might make that tricky.