Haicheng Wu

Results 323 comments of Haicheng Wu

It is doable, just we haven't done it yet. Do you want to add a mxn matrix or a per channel bias vector?

Now, we assume the 2nd gemm problem size k is multiple of the threadblock tile size k. We can fix it pretty quickly. Before that, you can first use the...

The code you posted belongs to cutlass 0.1. The current cutlass looks very different. Here is how the top level looks like if you use tensor cores: https://github.com/NVIDIA/cutlass/blob/master/examples/14_ampere_tf32_tensorop_gemm/ampere_tf32_tensorop_gemm.cu#L212-L226 thread number...

Different problem size needs different tile sizes. You can use cutlass profiler to find it. Here is the doc: https://github.com/NVIDIA/cutlass/blob/master/media/docs/profiler.md You can use `cmake .. -DCUTLASS_NVCC_ARCHS="75" -DCUTLASS_LIBRARY_KERNELS=sgemm` to only generate...

`LinearCombinationRelu` has default value for `beta`. `LinearCombinationSilu` does not. I can add one very quickly. To work around it, you can change this line (https://github.com/NVIDIA/cutlass/blob/master/examples/17_fprop_per_channel_bias/fprop_per_channel_bias.cu#L196) to `{alpha, ElementComputeEpilogue(0)}`

`ThreadblockShape`,`WarpShape` are as their name suggested the tile size of threadblock and warp respectively. `InstructionShape` is the size of the tensor core instructions. You can check the kernels generated by...

> That is to say, the use of tensorcore for cutlass needs to meet the requirements for half: input and output channels need 8-aligned nhwc format We support small alignment...

I added defaults of beta in https://github.com/NVIDIA/cutlass/commit/e49f690fd7969015343a2b5d72549848e760eb65

Your channel number is small, your filter size is small too. So, not much time is spent in conv. silu is an expensive operation so that it can take the...