Matthew Nicely

Results 113 comments of Matthew Nicely

@leofang Not dumb at all :smile: it's just personal preference. I like how it catches illegal narrowing at compile time.

@MoFHeka @arogozhnikov can you both try again with the latest nightlies? The following should work ``` pip3 install -U --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124 ``` with ``` import torch...

Thanks @katherineding, I'll get this fixed updated soon.

cuBLAS relies on [heuristics](https://developer.nvidia.com/blog/introducing-grouped-gemm-apis-in-cublas-and-more-performance-updates/#runtime_heuristics) to find the best kernel based on the input parameters. Heuristics return the best kernels 90+% of the time. You can autotune on top of this...

Hi @sleepwalker2017, sorry I dropped the ball on this. What you're seeing from the manual is cuBLAS shifting efforts to cuBLASLt for power-usage of GEMMs a few year ago. This...

@dfyz, cuBLAS will resolve this issue with Grouped GEMM in an upcoming release. I agree it would be good to fix in CUTLASS, but we'll need to revisit when we...

Hi @dfyz, it's more us spending time to review the code and any ripple efforts, internal verification and testing, and then productization. Combine this with high priority tasks and bugs;...

>possible clang version mangles something ? It's possible we don't support cuda-clang. You may want to reach out to the XLA team

@ssiu > In other words does streak-K outperform cuBLAS for GEMM with a large number of blocks? cuBLAS using Steam-K