ziyu huang comments

Results 5 comments of


                                            ziyu huang

[QST] How many threads and blocks does cutlass use? (When C is tall in official post)

Oh, thank you!! 1. Actually I am not using tensor core, if you have a version without it, better~I am using 1650 (turing, sm75) actually. 2. Also, I notice here...

[QST] How slice K reduce the value?

> > we want to reduce between different wraps? > > Correct. > > > So we can not use wrap_sync function like shlf_down_sync? We will use atomicAdd to global...

[QST] How slice K reduce the value?

using Gemm = cutlass::gemm::device::Gemm< cutlass::half_t, cutlass::layout::ColumnMajor, cutlass::half_t, cutlass::layout::RowMajor, ElementOutput, cutlass::layout::RowMajor, ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm75, cutlass::gemm::GemmShape, cutlass::gemm::GemmShape, cutlass::gemm::GemmShape, cutlass::epilogue::thread::LinearCombination< ElementOutput, 64 / cutlass::sizeof_bits::value, ElementAccumulator, ElementAccumulator >, cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle, 2>; I see this in...

[QST] How slice K reduce the value?

Thank you!! One last question, how we choose splitK or sliceK? They cope with a **same** problem (large K) using different policy? Actually I am implementing this volta_sgemm_64x32_sliced1x4_nn(1, 96, 1)x(256,...

JIT cpp extension fails to find cl.exe on windows?

For me, my solution is to buy a ubuntu PC and install NVIDIA packages. And I can run the code!