Matthew Nicely comments

Results 113 comments of


                                            Matthew Nicely

[FEA] Added functionality to ElementwiseKernel

@leofang Not dumb at all :smile: it's just personal preference. I like how it catches illegal narrowing at compile time.

What's the difference of flash attention implement between cudnn and Dao-AILab?

@MoFHeka @arogozhnikov can you both try again with the latest nightlies? The following should work ``` pip3 install -U --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124 ``` with ``` import torch...

fix typo in cublas_gemm_example.cu comments

Thanks @katherineding, I'll get this fixed updated soon.

Surprised by the performance of Triton!

cuBLAS relies on [heuristics](https://developer.nvidia.com/blog/introducing-grouped-gemm-apis-in-cublas-and-more-performance-updates/#runtime_heuristics) to find the best kernel based on the input parameters. Heuristics return the best kernels 90+% of the time. You can autotune on top of this...

Surprised by the performance of Triton!

Hi @sleepwalker2017, sorry I dropped the ball on this. What you're seeing from the manual is cuBLAS shifting efforts to cuBLASLt for power-usage of GEMMs a few year ago. This...

Fix illegal memory accesses in multistage `Mma's` for `k=0`

@dfyz, cuBLAS will resolve this issue with Grouped GEMM in an upcoming release. I agree it would be good to fix in CUTLASS, but we'll need to revisit when we...

Fix illegal memory accesses in multistage `Mma's` for `k=0`

Hi @dfyz, it's more us spending time to review the code and any ripple efforts, internal verification and testing, and then productization. Combine this with high priority tasks and bugs;...

cutlass fails during tensorflow assembly

>possible clang version mangles something ? It's possible we don't support cuda-clang. You may want to reach out to the XLA team

[QST/BUG] why cute kernel transfers so much data between L2 and gmen than cublas kernel

@ssiu > In other words does streak-K outperform cuBLAS for GEMM with a large number of blocks? cuBLAS using Steam-K

[cudnn_frontend] Error: No execution plans support the graph.

cuDNN SDPA doesn't support Turing GPUs.