pytorch
pytorch copied to clipboard
Worse performance than ATen: aten._log_softmax_backward_data
🐛 Describe the bug
aten._log_softmax_backward_data.default
Here's the result comparing to ATen:
benchmark | geomean | 20th percentile | 50th percentile | 80th percentile |
---|---|---|---|---|
HuggingFace | 0.94 | 0.87 | 0.97 | 0.99 |
Torchbench | 0.98 | 0.98 | 0.98 | 0.98 |
TIMM | 0.99 | 0.98 | 0.99 | 0.99 |
Both ATen and nvFuser path are using CUDA Graphs.
Apply this patch first
diff --git a/torch/_prims/context.py b/torch/_prims/context.py
index 203d73fd94..1789775e05 100644
--- a/torch/_prims/context.py
+++ b/torch/_prims/context.py
@@ -254,9 +254,9 @@ def _is_func_unsupported_nvfuser(
class TorchRefsNvfuserCapabilityMode(TorchRefsMode):
def __init__(self, *, skip_ops=()):
aten_ops_to_skip = (
- "aten._log_softmax.default",
- "aten._log_softmax_backward_data.default",
- "aten.expand.default",
+ #"aten._log_softmax.default",
+ #"aten._log_softmax_backward_data.default",
+ #"aten.expand.default",
)
self.skip_ops = tuple(skip_ops) + aten_ops_to_skip
super().__init__(
git clone https://gitlab-master.nvidia.com/iyashchuk/aten_ops_perf.git
cd aten_ops_perf
python aten_ops_perf.py --suite huggingface --dtype float32 --max-samples 100 --op aten._log_softmax_backward_data.default
Check out this gist for the logs: https://gist.github.com/IvanYashchuk/8f433d9512ab1f02a7f960072ba10bb0#file-issue_log_softmax_backward-md
_log_softmax_backward_data
is implemented here: https://github.com/pytorch/pytorch/blob/35be73df094f02dd26562cf665a6158e80bc4045/torch/_decomp/decompositions.py#L702-L710
Versions
Checked on upstream master.