pytorch
pytorch copied to clipboard
Worse performance than ATen: aten._log_softmax
🐛 Describe the bug
aten._log_softmax.default
Here's the result comparing to ATen:
benchmark | geomean | 20th percentile | 50th percentile | 80th percentile |
---|---|---|---|---|
HuggingFace | 0.91 | 0.63 | 0.99 | 1.21 |
Torchbench | 0.99 | 0.99 | 0.99 | 0.99 |
TIMM | 0.99 | 0.98 | 0.99 | 1.0 |
Both ATen and nvFuser path are using CUDA Graphs.
Apply this patch first
diff --git a/torch/_prims/context.py b/torch/_prims/context.py
index 203d73fd94..1789775e05 100644
--- a/torch/_prims/context.py
+++ b/torch/_prims/context.py
@@ -254,9 +254,9 @@ def _is_func_unsupported_nvfuser(
class TorchRefsNvfuserCapabilityMode(TorchRefsMode):
def __init__(self, *, skip_ops=()):
aten_ops_to_skip = (
- "aten._log_softmax.default",
- "aten._log_softmax_backward_data.default",
- "aten.expand.default",
+ #"aten._log_softmax.default",
+ #"aten._log_softmax_backward_data.default",
+ #"aten.expand.default",
)
self.skip_ops = tuple(skip_ops) + aten_ops_to_skip
super().__init__(
git clone https://gitlab-master.nvidia.com/iyashchuk/aten_ops_perf.git
cd aten_ops_perf
python aten_ops_perf.py --suite huggingface --dtype float32 --max-samples 100 --op aten._log_softmax.default
Check out this gist for the logs: https://gist.github.com/IvanYashchuk/8f433d9512ab1f02a7f960072ba10bb0
Badly performing samples are:
- (512, 50265) dim=1
- (8192, 50265) dim=1
- (1024, 50265) dim=1
- (4096, 50265) dim=1
- (2048, 50265) dim=1
- (511, 30522) dim=1
- (2048, 50005) dim=1
- (256, 256008) dim=1
- (157, 50257) dim=1
- (1024, 50005) dim=1
- (256, 128112) dim=1
- (64, 128) dim=1
- (1024, 50358) dim=1
- (508, 50272) dim=1
- (511, 50257) dim=1
_log_softmax
is implemented here: https://github.com/pytorch/pytorch/blob/35be73df094f02dd26562cf665a6158e80bc4045/torch/_decomp/decompositions.py#L988-L1006
Versions
Checked on upstream master.