pytorch icon indicating copy to clipboard operation
pytorch copied to clipboard

bookend `view` should be stripped from fusion

Open jjsjann123 opened this issue 3 years ago • 6 comments

🚀 The feature, motivation and pitch

Currently the handling of view in scheduler is sub-optimal.

For views inside the fusion group that connects fusion, it makes sense, since this usually gives us a bigger graph and could be beneficial to fusion. However, for view ops that surrounds a fusion, there's really no good reason to fuse those. Since view op is only a meta operation, leaving it outside of fusion could give us better codegen perf for now.

A few concrete issues that was raised by @kevinstephano:

  1. Here’s another example of a fusion we just might not want to send to nvFuser as this is a transpose after Matmul2 and before the Output Linear of Multihead attention that is resulting in 2 kernels. The 2 kernels take ~50 us whereas the single Transpose kernel in eager ~40us.
def nvfuser_fusion_id7(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(symbolic_sizes=[-1, -1, -1], contiguous=[True, True, True], dtype=DataType.Float)
    T1 = fd.ops.view(T0, original_shape=[768, 128, 64], new_shape=[64, 12, 128, 64])
    T2 = fd.permute(T1)
    T3 = fd.ops.set(T2)
    T4 = fd.ops.view(T3, original_shape=[64, 128, 12, 64], new_shape=[64, 128, 768])
    T5 = fd.ops.view(T4, original_shape=[64, 128, 768], new_shape=[8192, 768])
    fd.add_output(T5)

Arguments for fusion8:
Inputs:
tensor dtype: float sizes: (768, 128, 64, ) stride: (8192, 64, 1, ) pointer: 0x7f53a6000000
Outputs:
Launch Parameters: BlockDim.x = 128, BlockDim.y = -1, BlockDim.z = -1, GridDim.x = -1, GridDim.y = -1, GridDim.z = -1, Smem Size = 0
  1. In the Feed-Forward Block, Gelu is getting decomposed and re-fused. It is also sucking in the bookend views. The result is a kernel that is running at 0.62x of the Eager Mode Gelu Kernel. 191 us vs 118 us.
def nvfuser_fusion_id10(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(symbolic_sizes=[-1, -1], contiguous=[True, True], dtype=DataType.Float)
    T1 = fd.ops.view(T0, original_shape=[8192, 3072], new_shape=[64, 128, 3072])
    S2 = fd.define_constant(0.500000)
    T3 = fd.ops.mul(T1, S2)
    S4 = fd.define_constant(0.707107)
    T5 = fd.ops.mul(T1, S4)
    T6 = fd.ops.erf(T5)
    S7 = fd.define_constant(1.00000)
    T8 = fd.ops.add(T6, S7)
    T9 = fd.ops.mul(T3, T8)
    T10 = fd.ops.view(T9, original_shape=[64, 128, 3072], new_shape=[8192, 3072])
    fd.add_output(T1)
    fd.add_output(T10)

Arguments for fusion12:
Inputs:
tensor dtype: float sizes: (8192, 3072, ) stride: (3072, 1, ) pointer: 0x7f5148800000
Outputs:
Launch Parameters: BlockDim.x = 128, BlockDim.y = -1, BlockDim.z = -1, GridDim.x = -1, GridDim.y = -1, GridDim.z = -1, Smem Size = 0

Current plan is to add a quick pass that strips view-like op on the boundary of fusion. We'll see how that would impact the perf.

Alternatives

No response

Additional context

No response

jjsjann123 avatar Oct 17 '22 23:10 jjsjann123

I have a PR enabling this, along with some clean up on Ivan's refactoring on getitem special handling. #2099

jjsjann123 avatar Oct 18 '22 10:10 jjsjann123

Repro of permute based fusion:

import torch
from torch._C._nvfuser import Fusion, FusionDefinition, DataType

def nvfuser_fusion_id0(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(symbolic_sizes=[-1, -1, -1], contiguous=[True, True, True], dtype=DataType.Float)
    T1 = fd.ops.view(T0, original_shape=[768, 128, 64], new_shape=[64, 12, 128, 64])
    T2 = fd.ops.permute(T1, dims=[0, 2, 1, 3])
    T3 = fd.ops.set(T2)
    T4 = fd.ops.view(T3, original_shape=[64, 128, 12, 64], new_shape=[64, 128, 768])
    T5 = fd.ops.view(T4, original_shape=[64, 128, 768], new_shape=[8192, 768])
    fd.add_output(T5)

fs = Fusion()
with FusionDefinition(fs) as fd :
    nvfuser_fusion_id0(fd)

inputs = [
    torch.randn(768, 128, 64, device='cuda')
]

for _ in range(5):
    out = fs.execute(inputs)

This is a more complicated operation that is a transpose with bookend views. The resulting fusion is segmented into two kernels.

kevinstephano avatar Dec 13 '22 04:12 kevinstephano

Here is a second example that should be easier as the second view undoes the first view and I would start here. This is a Gelu operation from the Feed Forward network of a Transformer network.

import torch
from torch._C._nvfuser import Fusion, FusionDefinition, DataType

def nvfuser_fusion_id0(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(symbolic_sizes=[-1, -1], contiguous=[True, True], dtype=DataType.Float)
    T1 = fd.ops.view(T0, original_shape=[8192, 3072], new_shape=[64, 128, 3072])
    S2 = fd.define_constant(0.500000)
    T3 = fd.ops.mul(T1, S2)
    S4 = fd.define_constant(0.707107)
    T5 = fd.ops.mul(T1, S4)
    T6 = fd.ops.erf(T5)
    S7 = fd.define_constant(1.00000)
    T8 = fd.ops.add(T6, S7)
    T9 = fd.ops.mul(T3, T8)
    T10 = fd.ops.view(T9, original_shape=[64, 128, 3072], new_shape=[8192, 3072])
    fd.add_output(T10)

fs = Fusion()
with FusionDefinition(fs) as fd :
    nvfuser_fusion_id0(fd)

inputs = [
    torch.randn(8192, 3072, device='cuda')
]

for _ in range(5):
    out = fs.execute(inputs)

Perf:

 1534646325         196768     334  49152     1     1   128     1     1       28         0.000         0.000                                                     NVIDIA A100 80GB PCIe (0)    1     7  CudaCodeGen::kernel1(CudaCodeGen::Tensor<float, (int)2>, CudaCodeGen::Tensor<float, (int)3>, CudaCo…
 1534844117         186913     344      6  8192     1   128     1     1       16         0.000         0.000                                                     NVIDIA A100 80GB PCIe (0)    1     7  CudaCodeGen::kernel2(CudaCodeGen::Tensor<float, (int)3>, CudaCodeGen::Tensor<float, (int)3>, CudaCo…

A100 Kernel Time = 197us + 187us = 384us

Repro no views:

import torch
from torch._C._nvfuser import Fusion, FusionDefinition, DataType

def nvfuser_fusion_id0(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(symbolic_sizes=[-1, -1, -1], contiguous=[True, True, True], dtype=DataType.Float)
    S2 = fd.define_constant(0.500000)
    T3 = fd.ops.mul(T0, S2)
    S4 = fd.define_constant(0.707107)
    T5 = fd.ops.mul(T0, S4)
    T6 = fd.ops.erf(T5)
    S7 = fd.define_constant(1.00000)
    T8 = fd.ops.add(T6, S7)
    T9 = fd.ops.mul(T3, T8)
    fd.add_output(T9)


fs = Fusion()
with FusionDefinition(fs) as fd :
    nvfuser_fusion_id0(fd)

inputs = [
    torch.randn(64, 128, 3072, device='cuda')
]

for _ in range(5):
    out = fs.execute(inputs)

Perf:

1326379136         127169     226  49152     1     1   128     1     1       29         0.000         0.000                                                     NVIDIA A100 80GB PCIe (0)    1     7  CudaCodeGen::kernel1(CudaCodeGen::Tensor<float, (int)3>, CudaCodeGen::Tensor<float, (int)3>) 

A100 Kernel time = 127us.

Note, I think Christian made an effort to fix splitting the views into two kernels, but the perf did not match the non-view perf. I believe this was his code branch:

https://github.com/csarofeen/pytorch/commits/view_reduction

This PR shows the differences between the branch and the devel branch so you can see what was changed. https://github.com/csarofeen/pytorch/pull/2039

kevinstephano avatar Dec 13 '22 04:12 kevinstephano

I looked at this example using PYTORCH_NVFUSER_DUMP=segmenter_logging as well as a little snooping with gdb. It seems that the segmenter hits the second view and has to segment. The reason is that the pointwise scheduler rejects it, due to not being able to find a reference tv to use in parallelizeAll. See here: https://github.com/csarofeen/pytorch/blob/devel/third_party/nvfuser/csrc/scheduler/pointwise.cpp#L42

That check fails because DomainMap::isValidReference basically is checking that all of the input IterDomains map to it, but those iter domains all get scrambled during the view operation.

jacobhinkle avatar Jan 19 '23 21:01 jacobhinkle

I don't understand why the line after doesn't get put in the first segment. See the "NOTE" in this copy of Kevin's Fusion

def nvfuser_fusion_id0(fd: FusionDefinition, insert_views: bool = True) -> None:                                                           
    T0 = fd.define_tensor(                                                                                                                 
        symbolic_sizes=[-1, -1], contiguous=[True, True], dtype=DataType.Float                                                             
    )                                                                                                                                      
    T1 = T0                                                                                                                                
    if insert_views:                                                                                                                       
        T1 = fd.ops.view(T0, original_shape=[8192, 3072], new_shape=[64, 128, 3072])                                                       
    S2 = fd.define_constant(0.500000)                                                                                                      
    T3 = fd.ops.mul(T1, S2)                                                                                                                
    S4 = fd.define_constant(0.707107)                                                                                                      
    T5 = fd.ops.mul(T1, S4)                                                                                                                
    T6 = fd.ops.erf(T5)                                                                                                                    
    S7 = fd.define_constant(1.00000)                                                                                                       
    T8 = fd.ops.add(T6, S7)                                                                                                                
    # NOTE: this is where the segmenting is happening with views                                                                           
    T9 = fd.ops.mul(T3, T8)                                                                                                                
    T10 = T9                                                                                                                               
    if insert_views:                                                                                                                       
        T10 = fd.ops.view(T9, original_shape=[64, 128, 3072], new_shape=[8192, 3072])                                                      
    fd.add_output(T10)   

If the line after got placed in kernel1 then kernel2 would be a no-op I think

jacobhinkle avatar Jan 19 '23 21:01 jacobhinkle

Studying how views affect segmentation with this sequential fusion:

def simple_fusion(fd: FusionDefinition) -> None:
    T = fd.define_tensor(
        symbolic_sizes=[-1, -1], contiguous=[True, True], dtype=DataType.Float
    )                                                                                                                                                       
    current_shape = [8192, 3072]                                                                                                                           
    other_shape = [64, 128, 3072]                                                                                                                                                                                                                                                                      
    def insert_view(_T):                                                                                                                                   
        """Insert a view that swaps between the two shapes above"""                                                                                        
        nonlocal current_shape, other_shape                                                                                                                
        new_shape = other_shape                                                                                                                            
        out = fd.ops.view(_T, original_shape=current_shape, new_shape=new_shape)                                                                           
        other_shape, current_shape = current_shape, other_shape                                                                                            
        return out                                                                                                                                         
                                                                                                                                                           
    T = insert_view(T)     # A                                                                                                                     
    T = fd.ops.add(T, T)                                                                                                                                   
    T = insert_view(T)     # B                                                                                                                     
    T = fd.ops.mul(T, T)                                                                                                                                   
    T = insert_view(T)     # C
                                                                                                                    
    fd.add_output(T) 

Running this as is gives three segments: view+add, view only, mul+view.

When we comment out A we get 2 segments: add+view+mul, view only

When we comment out B we get 2 segments: view+add+mul, view only

When we comment out C we get 2 segments: view+add, view+mul

When we comment out A+B we get 1 segment: add+mul+view

When we comment out A+C we get 1 segment: add+view+mul

When we comment out B+C we get 1 segment: view+add+mul

When we comment them all out we get 1 segment: add+mul

It's interesting that when removing A but leaving B, we are able to fuse into a single segment add+view+mul; without A, C only adds a "view-only segment" which could be removed pretty easily since it's a metadata operation. However, since the leading view A is allowed to fuse with subsequent expressions, when we add it in it disrupts the add+view+mul fusion. If there were a registered scheduler that tried to detect metadata-only operations on inputs or outputs, then depending on implementation it could give us a single kernel in all of these cases. But the way findSegments is implemented might make that tricky.

jacobhinkle avatar Jan 20 '23 16:01 jacobhinkle