Aleksandar Samardžić comments

Results 60 comments of


                                            Aleksandar Samardžić

Update CUTLASS-based sparse semi-structured GEMM

This PR updates CUTLASS-bases sparse semi-structured GEMM implementation: it replaces use of `SparseGemmRowBroadcast` GEMM variation with using recently added EVT epilogue support for sparse GEMM - former was pretty much...

Update CUTLASS-based sparse semi-structured GEMM

> @alexsamardzic - We'll want to update to the next version of CUTLASS before we can pull this in. Do you know when the planned release is? Is the required...

Update CUTLASS-based sparse semi-structured GEMM

Merged into main along with CUTLASS update to 3.4.1, through [PR 120434](https://github.com/pytorch/pytorch/pull/120434).

[WIP] CUTLASS 2:4 sparsity integration

My initial findings, on a Paperspace machine with A100 and with CUDA SDK 11.7.1: 1. The code will report an error in `run()` method, in `CusparseLtKernels.cu`. When I did some...

[WIP] CUTLASS 2:4 sparsity integration

Sorry, by tests passing I meant replacing `[1, 1, 0, 0]` and `[17476]` in the benchmarking script above with values corresponding to other sparsity patterns, and re-running the script -...

[WIP] CUTLASS 2:4 sparsity integration

The reason that `two_four_sparse` test doesn't work, and in general that the code won't work in most cases, is that our version of `reorder_meta()` assumes that reordered meta tensor has...

[WIP] CUTLASS 2:4 sparsity integration

The problem is now to expose CUTLASS tensor somehow to Python, will look into this and push when I find a satisfactory solution.

[WIP] CUTLASS 2:4 sparsity integration

CUTLASS tensors have separated layout objects, so it would be necessary to implement some kind of serialization for these in order to be able to pass them to Python. Thus...

[WIP] CUTLASS 2:4 sparsity integration

I suspected at these numbers too, and already experimented with some other combinations, including ones from `15_ampere_sparse_tensorop_gemm` CUTLASS example, as well as ones used by `cutlass_profiler` for m=n=512, k=1024 case...

[WIP] CUTLASS 2:4 sparsity integration

These tile sizes (found using `cutlass_profiler`) provide for at least 10% better performance than dense multiplication. Some hints for tile sizes selection: - check tuning guide for given datatype and...