Jack Kosaian

Results 62 comments of Jack Kosaian

Also, your version that uses `GemmUniversal` will need to perform a second reduction kernel after calling the GEMM in order to reduce the partial outputs. You can see an example...

Yes, it is possible. For example, if you configure CUTLASS with `cmake .. -DCUTLASS_NVCC_ARCHS="80"`, you can see an INT8 Conv2d kernel emitted under `tools/library/generated/conv2d/80/i16832fprop_optimized_s8/cutlass_tensorop_i16832fprop_optimized_s8_256x128_64x3_nhwc_align16.cu`

As Vijay mentioned, that is the right scheduler to use. Here's a diff that I just used to adapt example 67 (groupwise) to use the stream-K scheduler: ```diff index d6de7f89..556e74c7...

If you'd like to use a split-K decomposition, you can set the `splits` argument as done in the Blackwell stream-K example [here](https://github.com/NVIDIA/cutlass/blob/affd1b693dfc121c51118cbc8583dfd308227ca6/examples/74_blackwell_gemm_streamk/blackwell_gemm_streamk.cu#L435). You can also consider using non-deterministic reduction, which...

Yes, the memset is necessary. Stream-K uses counters in global memory for determining the order in which CTAs can accumulate their partial results. These counters needs to be initialized to...

Thanks for reporting. This is due to a bug in the CUTLASS 3.x implementation of "separate reduction." For the time being, you can circumvent this with the following change, which...

There is no timeline for when the separate reduction implementation will be fixed. We plan to roll out the patch I described soon, though. There is no performance implication because,...

Thanks. Would you be willing to contribute a PR to fix this?

I don't have an example of how to prepare int4 data correctly in Python. There has been some discussion about this in the past here: https://github.com/NVIDIA/cutlass/issues/756