Kevin Stephano comments

Results 30 comments of


                                            Kevin Stephano

CudaGraph Capturing failures on `_to_copy()` operations

This looks like the problem where maybe the mask isn't getting properly handled as an input: https://github.com/huggingface/transformers/blob/ebee0a27940adfbb30444d83387b9ea0f1173f40/src/transformers/models/bart/modeling_bart.py#L96-L99

CudaGraph Capturing failures on `_to_copy()` operations

The diff patch version of Ivan's fix: ``` diff --git a/torchdynamo/optimizations/training.py b/torchdynamo/optimizations/training.py index dca9202..1f2178b 100644 --- a/torchdynamo/optimizations/training.py +++ b/torchdynamo/optimizations/training.py @@ -356,6 +356,17 @@ def prims_executor(gm, inputs, *, executor, num_fixed=0): from...

CudaGraph Capturing failures on `_to_copy()` operations

Need to add one more op: ``` for node in gm.graph.nodes: if node.op == "call_function" and node.target in [ torch.ops.aten.arange.default, torch.ops.aten.arange.start_step, torch.ops.aten.full.default, ]: new_kwargs = dict(node.kwargs) if new_kwargs.get("device", False) and...

HuggingFace BertForMaskedLM - Log Softmax Fusion with Autocast has bad perf

A couple things to try: 1. Remove the top `view` from the Fusion. 2. Remove `fd.add_output(T11)`. It's not clear to me that saving this tensor is necessary.

HuggingFace BertForMaskedLM - Log Softmax Fusion with Autocast has bad perf

Removing the `view` clearly matters for perf.

HuggingFace BertForMaskedLM - Log Softmax Fusion with Autocast has bad perf

Another example to think about: ``` import torch from torch._C._nvfuser import FusionDefinition, Fusion, DataType def nvfuser_fusion_id0(fd : FusionDefinition) -> None : T0 = fd.define_tensor(symbolic_sizes=[-1, -1], contiguous=[True, True], dtype=DataType.Half) T1 =...

HuggingFace BertForMaskedLM - Log Softmax Fusion with Autocast has bad perf

The final omission of the final subtract seems to change the registers needed from 87 to 50 which is large enough to change the kernel time from 3.5 ms to...

HuggingFace BertForMaskedLM - Log Softmax Fusion with Autocast has bad perf

I meant to show that this version has much better performance without the `sub` at the end of the fusion: ``` import torch from torch._C._nvfuser import FusionDefinition, Fusion, DataType def...

HuggingFace BertForMaskedLM - Bad Loss Function Perf

We don't have the ability to fuse the loss function without supporting `gather`.

HuggingFace BertForMaskedLM - Bad Loss Function Perf

``` @register_decomposition(aten.nll_loss_forward) def nll_loss_forward( self: Tensor, target: Tensor, weight: Optional[Tensor], reduction: int, ignore_index: int, ) -> Tuple[Tensor, Tensor]: assert self.dim() > 0 and self.dim() = 0: result = torch.where(target !=...