ghostplant issues

Results 40 issues of


                                            ghostplant

How to get KMDUMPLLVM=1 IR without following procedure?

Is it possible to get the file `dump.kernel_input.ll` without following compilation procedure like the generation of host_binary / hsaco_binary. In other words, instead of `KMDUMPLLVM=1 hipcc main.cc -o useless.out`, I...

Compilation fault for HIP compiler

Source code: ```sh #include __global__ void ConcatV22concatenate_kernel0( float* __restrict__ A75, float* __restrict__ A74, float* __restrict__ A73, float* __restrict__ A72, float* __restrict__ A71, float* __restrict__ A70, float* __restrict__ A69, float* __restrict__...

Compiling an HIP kernel source is slow

Usually it takes over 10-20 seconds to finish compiling only one kernel (written in C device code), is it possible to shorten such compilation time, and any side effects?

About bank conflict problem

For AMD ROCm, is there bank conflicts problem? Specifically, 1) Is there shared memory bank conflicts? What about the data stride between banks? 2) Is there local memory bank conflicts?...

How to change the autotune setting for kernel 9?

I just get 15TFlops on A100 (sm80), and 6TFlops on 2080ti (sm75). If tuning properly, it should be able to get > 17TFlops for A100 and > 12Tflops for 2080ti,...

[FR] Expect to support PoolingBWD with zero workspace

MIOpenPoolingBackwards requires additional workspace while CUDNN doesn't, and this complicates the porting work from CUDNN to MIOpen since it is not easy sometimes to pair the Forwards and Backwards invocations...

enhancement

request_for_comments

[QST] Does CUTE for SM90 support BFloat16 MMA?

**Does CUTE for SM90 support BFloat16 MMA?** I am searching for a MMA instruction designed for B16B16B16 from https://github.com/NVIDIA/cutlass/blob/main/include/cute/atom/mma_traits_sm90_gmma.hpp#L2824. But the whole candidates don't have any Fully `BF16BF16BF16` choice.

question

? - Needs Triage

[FR] Will Pytorch support GemmGrouped as built-in operator?

### 🚀 The feature, motivation and pitch Currently, using grouped_gemm operator requires either personally binding with cutlass extension, or installing other non-cutlass extensions developed by 3rd-party, none of which is...

module: performance

triaged

module: nestedtensor

needs design

What's the throughput of R1 671B using bs=1 without quant?

For h200, what's the throughput of R1 671B using bs=1 without quant?

not a bug

How to reproduce 150 TPS using FP8 + MTP=0 + BSZ=1 on H200?

So glad to see a great update of TRT-LLM which largely improves H200x8 to 150 TPS for R1. But what I get locally is just 7 TPS. What's the correct...

triaged