Driss Guessous
Driss Guessous
Just to confirm, what else is needed 1. Land: https://github.com/pytorch/builder/pull/2014 in parallel which updates windows builds? 2. And then making sure the wheels are upload to org/whl/nvidia-cudnn-cu12/ ? @atalman I...
@Skylion007 the 2.5 release is being kicked off right now I am sure that @atalman is pretty busy this week but let me see if there is anything I can...
Haven't forgotten about this, talked w/ @htyu and we think the increased SHMEM usage is unexpected and is likely a bug in Triton's allocation analysis, hope to pick back up...
Here is a trace of the exact striding behavior: ## Forward ``` Shell $1: f16[512, 1536] strides=[1, 512] = aten.t.default($0) $3: f16[2048, 512] strides=[512, 1] = aten.view.default($2, [2048, 512]) $4:...
Relevant Backward Striding ``` Shell $28: f16[1, 2048, 512] strides=[1048576, 512, 1] = aten.ones_like.default($27, pin_memory=False) $29: f16[2048, 512] strides=[512, 1] = aten.view.default($28, [2048, 512]) $30: f16[512, 2048] strides=[1, 512] =...
@ngimel suggests that we disable cuDNN until both forward and backwards op can handle the permuted case
@malfet we dont have any H100 runners in CI/CD. This would only show up in E2E torchbench like testing while most of the performance testing was done at the per-op...
I have been testing this on an a modified version of nanogpt: https://github.com/drisspg/nanoGPT/pull/1 Trace on Nighlty: [CUDNN_ATTENTION_nightly.json](https://github.com/user-attachments/files/17442360/CUDNN_ATTENTION_nightly.json) Trace on https://github.com/pytorch/pytorch/pull/138354 Built against CuDNN 9.4 w/ cuda 12.4 [CUDNN_ATTENTION_dev.json](https://github.com/user-attachments/files/17442362/CUDNN_ATTENTION_dev.json) ### Findings...
@ngimel The contiguous on 2048 seq-len takes around: 150 Micro seconds in forward and backward which is around 8 - 10% of forward sdpa time https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/drisspg_d785a2ea-6361-4781-a139-ff6a388da9c4_CUDNN_ATTENTION_major.json
Update: I think that something was wrong with my local CuDNN New settings: CuDNN: 12-9.1.1.17 Cuda-toolkit: 12-4 Sequence Length: 2048 CuDNN On [PR](138354): ` iter 50: loss 2.5125, time 83.60ms,...