cudnn-frontend Extremely slow fp8 conv2d wgrad operation

Extremely slow fp8 conv2d wgrad operation

Open jimgao1 opened this issue 5 months ago • 4 comments

Describe the bug fp8 e4m3 wgrad seems to be extremely slow compared to both FP32 and FP16, often 50x to 100x slower.

I have attached the profiling results in this Google spreadsheet.

I have tested a variety of problem sizes. For each size I have measured fp16 wgrad and fp8 wgrad with a number of different variants (wrt the IO/intermediate/compute data types).

Expected behavior We expect fp8 wgrad operators to be at least as fast (if not faster) than its fp16 and fp32 counterparts.

System Environment (please complete the following information):

cudnn_frontend version: v1.6.1 (commit 2533f5e5c1877fd76266133c1479ef1643ce3a8b)
cudnn_backend version: v9.3.0
GPU arch: H100
cuda runtime version: 12.2
cuda driver version: 535.161.08
host compiler: g++
OS: Ubuntu 22.04.4 LTS

API logs

Both frontend and backend logs are attached in this gist.

To Reproduce Compile and run the benchmarking script. Command I used to compile is:

/usr/local/cuda/bin/nvcc -I/home/ybgao/third_party/cudnn-frontend/include -std=c++20 -gencode=arch=compute_90,code=sm_90 -lcudnn -o main main.cu

Additional context This issue references this post on nvidia forums.

Aug 27 '24 18:08 jimgao1

cudnn-frontend cudnn-frontend copied to clipboard

Extremely slow fp8 conv2d wgrad operation

cudnn-frontend
cudnn-frontend copied to clipboard