ghostplant
ghostplant
Is it possible to get the file `dump.kernel_input.ll` without following compilation procedure like the generation of host_binary / hsaco_binary. In other words, instead of `KMDUMPLLVM=1 hipcc main.cc -o useless.out`, I...
Source code: ```sh #include __global__ void ConcatV22concatenate_kernel0( float* __restrict__ A75, float* __restrict__ A74, float* __restrict__ A73, float* __restrict__ A72, float* __restrict__ A71, float* __restrict__ A70, float* __restrict__ A69, float* __restrict__...
Usually it takes over 10-20 seconds to finish compiling only one kernel (written in C device code), is it possible to shorten such compilation time, and any side effects?
For AMD ROCm, is there bank conflicts problem? Specifically, 1) Is there shared memory bank conflicts? What about the data stride between banks? 2) Is there local memory bank conflicts?...
I just get 15TFlops on A100 (sm80), and 6TFlops on 2080ti (sm75). If tuning properly, it should be able to get > 17TFlops for A100 and > 12Tflops for 2080ti,...
MIOpenPoolingBackwards requires additional workspace while CUDNN doesn't, and this complicates the porting work from CUDNN to MIOpen since it is not easy sometimes to pair the Forwards and Backwards invocations...
**Does CUTE for SM90 support BFloat16 MMA?** I am searching for a MMA instruction designed for B16B16B16 from https://github.com/NVIDIA/cutlass/blob/main/include/cute/atom/mma_traits_sm90_gmma.hpp#L2824. But the whole candidates don't have any Fully `BF16BF16BF16` choice.
### 🚀 The feature, motivation and pitch Currently, using grouped_gemm operator requires either personally binding with cutlass extension, or installing other non-cutlass extensions developed by 3rd-party, none of which is...
For h200, what's the throughput of R1 671B using bs=1 without quant?
So glad to see a great update of TRT-LLM which largely improves H200x8 to 150 TPS for R1. But what I get locally is just 7 TPS. What's the correct...