ziyu huang
ziyu huang
Oh, thank you!! 1. Actually I am not using tensor core, if you have a version without it, better~I am using 1650 (turing, sm75) actually. 2. Also, I notice here...
> > we want to reduce between different wraps? > > Correct. > > > So we can not use wrap_sync function like shlf_down_sync? We will use atomicAdd to global...
using Gemm = cutlass::gemm::device::Gemm< cutlass::half_t, cutlass::layout::ColumnMajor, cutlass::half_t, cutlass::layout::RowMajor, ElementOutput, cutlass::layout::RowMajor, ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm75, cutlass::gemm::GemmShape, cutlass::gemm::GemmShape, cutlass::gemm::GemmShape, cutlass::epilogue::thread::LinearCombination< ElementOutput, 64 / cutlass::sizeof_bits::value, ElementAccumulator, ElementAccumulator >, cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle, 2>; I see this in...
Thank you!! One last question, how we choose splitK or sliceK? They cope with a **same** problem (large K) using different policy? Actually I am implementing this volta_sgemm_64x32_sliced1x4_nn(1, 96, 1)x(256,...
For me, my solution is to buy a ubuntu PC and install NVIDIA packages. And I can run the code!