Aleksandar Samardžić
Aleksandar Samardžić
This PR updates CUTLASS-bases sparse semi-structured GEMM implementation: it replaces use of `SparseGemmRowBroadcast` GEMM variation with using recently added EVT epilogue support for sparse GEMM - former was pretty much...
> @alexsamardzic - We'll want to update to the next version of CUTLASS before we can pull this in. Do you know when the planned release is? Is the required...
Merged into main along with CUTLASS update to 3.4.1, through [PR 120434](https://github.com/pytorch/pytorch/pull/120434).
My initial findings, on a Paperspace machine with A100 and with CUDA SDK 11.7.1: 1. The code will report an error in `run()` method, in `CusparseLtKernels.cu`. When I did some...
Sorry, by tests passing I meant replacing `[1, 1, 0, 0]` and `[17476]` in the benchmarking script above with values corresponding to other sparsity patterns, and re-running the script -...
The reason that `two_four_sparse` test doesn't work, and in general that the code won't work in most cases, is that our version of `reorder_meta()` assumes that reordered meta tensor has...
The problem is now to expose CUTLASS tensor somehow to Python, will look into this and push when I find a satisfactory solution.
CUTLASS tensors have separated layout objects, so it would be necessary to implement some kind of serialization for these in order to be able to pass them to Python. Thus...
I suspected at these numbers too, and already experimented with some other combinations, including ones from `15_ampere_sparse_tensorop_gemm` CUTLASS example, as well as ones used by `cutlass_profiler` for m=n=512, k=1024 case...
These tile sizes (found using `cutlass_profiler`) provide for at least 10% better performance than dense multiplication. Some hints for tile sizes selection: - check tuning guide for given datatype and...