cudnn-frontend
cudnn-frontend copied to clipboard
cudnn_frontend provides a c++ wrapper for the cudnn backend API and samples on how to use it
The shipped wheels on PyPI.org are missing the include files. Drastically reducing the usefulness of the package. CC: @ksivaman @ptrendx
Hi all, I am building the graph as in the image:  The [document](https://docs.nvidia.com/deeplearning/cudnn/latest/developer/graph-api.html#single-operation) suggests this graph is supported, but I got the seg fault at `get_uid` when doing `graph->execute`,...
Hi I tried running a sdpa_fp8 graph where seqlen_q and seqlen_k are different, however it seems that it only uses the seqlen_q as in performance is the same when I...
**Describe the bug** fp8 e4m3 wgrad seems to be extremely slow compared to both FP32 and FP16, often 50x to 100x slower. I have attached the profiling results in [this...
**Describe the bug** Consider a graph with more than MAX_OPGRAPH_OPS nodes, for example in this code ```cpp #include "cudnn-frontend/include/cudnn_frontend.h" namespace fe = cudnn_frontend; int main() { cudnnHandle_t handle; assert(cudnnCreate(&handle) ==...
**Describe the bug** I run the following code ```cpp #include "cudnn-frontend/include/cudnn_frontend.h" namespace fe = cudnn_frontend; int main() { cudnnHandle_t handle; assert(cudnnCreate(&handle) == CUDNN_STATUS_SUCCESS); std::vector A = {1.0, 2.0, 3.0, 4.0};...
I keep having issues when compiling apps that requires CUDA and C++ tools on windows I would like to learn best version for CUDA 11.8 and CUDA 12.4 There are...
Hello all, I'm currently working with convolutional layers in `cudnn - python`, and I have a couple of questions regarding the convolution algorithm selection and the setting of group numbers....
I've noticed when using Pytorch's custom autograd functions, that sometimes the stride of `dO` can be `(0, 0, 0, 0)`. Here's a very simple example: https://discuss.pytorch.org/t/getting-unusual-strides-when-using-pytorchs-autograd/208093. In my custom wrapper...
_Downstream PyTorch issue:_ https://github.com/pytorch/pytorch/issues/133780 **Describe the bug** cuDNN frontend rejects batch_size=0 input with `CUDNN_STATUS_BAD_PARAM` **Expected behavior** cuDNN should return to me a tensor [0, num_head, sequence_length, dims_per_head] something like that,...