Kevin Stephano

Results 30 comments of Kevin Stephano

This looks like the problem where maybe the mask isn't getting properly handled as an input: https://github.com/huggingface/transformers/blob/ebee0a27940adfbb30444d83387b9ea0f1173f40/src/transformers/models/bart/modeling_bart.py#L96-L99

The diff patch version of Ivan's fix: ``` diff --git a/torchdynamo/optimizations/training.py b/torchdynamo/optimizations/training.py index dca9202..1f2178b 100644 --- a/torchdynamo/optimizations/training.py +++ b/torchdynamo/optimizations/training.py @@ -356,6 +356,17 @@ def prims_executor(gm, inputs, *, executor, num_fixed=0): from...

Need to add one more op: ``` for node in gm.graph.nodes: if node.op == "call_function" and node.target in [ torch.ops.aten.arange.default, torch.ops.aten.arange.start_step, torch.ops.aten.full.default, ]: new_kwargs = dict(node.kwargs) if new_kwargs.get("device", False) and...

A couple things to try: 1. Remove the top `view` from the Fusion. 2. Remove `fd.add_output(T11)`. It's not clear to me that saving this tensor is necessary.

Another example to think about: ``` import torch from torch._C._nvfuser import FusionDefinition, Fusion, DataType def nvfuser_fusion_id0(fd : FusionDefinition) -> None : T0 = fd.define_tensor(symbolic_sizes=[-1, -1], contiguous=[True, True], dtype=DataType.Half) T1 =...

The final omission of the final subtract seems to change the registers needed from 87 to 50 which is large enough to change the kernel time from 3.5 ms to...

I meant to show that this version has much better performance without the `sub` at the end of the fusion: ``` import torch from torch._C._nvfuser import FusionDefinition, Fusion, DataType def...

We don't have the ability to fuse the loss function without supporting `gather`.

``` @register_decomposition(aten.nll_loss_forward) def nll_loss_forward( self: Tensor, target: Tensor, weight: Optional[Tensor], reduction: int, ignore_index: int, ) -> Tuple[Tensor, Tensor]: assert self.dim() > 0 and self.dim() = 0: result = torch.where(target !=...