Jack Kosaian

jackkosaian.github.io

Results 62 comments of


                                            Jack Kosaian

[QST] GemmUniversal is slower than GemmSplitKParallel when M and N are small and K is large

Also, your version that uses `GemmUniversal` will need to perform a second reduction kernel after calling the GEMM in order to reduce the partial outputs. You can see an example...

[QST] Integer Data Types are available for Conv2d fprop?

Yes, it is possible. For example, if you configure CUTLASS with `cmake .. -DCUTLASS_NVCC_ARCHS="80"`, you can see an INT8 Conv2d kernel emitted under `tools/library/generated/conv2d/80/i16832fprop_optimized_s8/cutlass_tensorop_i16832fprop_optimized_s8_256x128_64x3_nhwc_align16.cu`

Feature Request: Enable default authentication method with Amazon EC2 Instance Profile for Amazon Bedrock LLM provider

As Vijay mentioned, that is the right scheduler to use. Here's a diff that I just used to adapt example 67 (groupwise) to use the stream-K scheduler: ```diff index d6de7f89..556e74c7...

Feature Request: Enable default authentication method with Amazon EC2 Instance Profile for Amazon Bedrock LLM provider

If you'd like to use a split-K decomposition, you can set the `splits` argument as done in the Blackwell stream-K example [here](https://github.com/NVIDIA/cutlass/blob/affd1b693dfc121c51118cbc8583dfd308227ca6/examples/74_blackwell_gemm_streamk/blackwell_gemm_streamk.cu#L435). You can also consider using non-deterministic reduction, which...

Feature Request: Enable default authentication method with Amazon EC2 Instance Profile for Amazon Bedrock LLM provider

Yes, the memset is necessary. Stream-K uses counters in global memory for determining the order in which CTAs can accumulate their partial results. These counters needs to be initialized to...

[BUG] TMA Cooperative GeMM with Stream-K scheduler hangs for specific gemm shapes

Thanks for reporting. This is due to a bug in the CUTLASS 3.x implementation of "separate reduction." For the time being, you can circumvent this with the following change, which...

[BUG] TMA Cooperative GeMM with Stream-K scheduler hangs for specific gemm shapes

There is no timeline for when the separate reduction implementation will be fixed. We plan to roll out the patch I described soon, though. There is no performance implication because,...

[FEA][torchinductor-EVT] Allow function source code to be passed directly to EVT tracer

@apuaachen, can you help take a look at this EVT request?

[BUG] Cutlass python epilogue doesn't work with BF16

Thanks. Would you be willing to contribute a PR to fix this?

[QST] How to pack int4 tensor correctly in PyTorch

I don't have an example of how to prepare int4 data correctly in Python. There has been some discussion about this in the past here: https://github.com/NVIDIA/cutlass/issues/756

‹
1
2
3
4
5
6
7
›