Matthew Nicely
Matthew Nicely
@leofang Not dumb at all :smile: it's just personal preference. I like how it catches illegal narrowing at compile time.
@MoFHeka @arogozhnikov can you both try again with the latest nightlies? The following should work ``` pip3 install -U --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124 ``` with ``` import torch...
Thanks @katherineding, I'll get this fixed updated soon.
cuBLAS relies on [heuristics](https://developer.nvidia.com/blog/introducing-grouped-gemm-apis-in-cublas-and-more-performance-updates/#runtime_heuristics) to find the best kernel based on the input parameters. Heuristics return the best kernels 90+% of the time. You can autotune on top of this...
Hi @sleepwalker2017, sorry I dropped the ball on this. What you're seeing from the manual is cuBLAS shifting efforts to cuBLASLt for power-usage of GEMMs a few year ago. This...
@dfyz, cuBLAS will resolve this issue with Grouped GEMM in an upcoming release. I agree it would be good to fix in CUTLASS, but we'll need to revisit when we...
Hi @dfyz, it's more us spending time to review the code and any ripple efforts, internal verification and testing, and then productization. Combine this with high priority tasks and bugs;...
>possible clang version mangles something ? It's possible we don't support cuda-clang. You may want to reach out to the XLA team
@ssiu > In other words does streak-K outperform cuBLAS for GEMM with a large number of blocks? cuBLAS using Steam-K
cuDNN SDPA doesn't support Turing GPUs.