cloudhan issues

Results 33 issues of


                                            cloudhan

[DONT MERGE] Fp4 experiment

@zjing14 This is a fully correct pipeline that support packed fp4 (2 `int4`s in a byte). This is used for demonstrate what might need to be changed to support subtype...

Sequence length 1 GEMV alternative for fused attention

Sequence length 1 is extremely important for decoding (ASR, text generation, etc) In onnxruntime, we found the rocblas gemm + sofmax kernel +rocblas gemm is much faster for this case,...

enhancement

Missing non-biased and bias and masked version client API for flash attention

Current client API `ck::tensor_operation::device::DeviceBatchedGemmSoftmaxGemmPermute` only has 1 D0 and exact 1 D0 version build into the client library. The non-bias version have no instance builtin. Please add instance to the...

How was the data in the blog measured?

In the 2024-02-02 blog post, for example ![](https://flashinfer.ai/assets/imgs/single-decode-benchmark.png) I tried to repro it simply with ncu data for numseq 1 and seqlen 16384 on 4090: ``` void vllm::paged_attention_v2_kernel(float *, float...

Add GQA support for ROCm

depends on - [x] #20913 - [x] #21028 - [x] #21030

feat: allow print_latex(TiledMMA) to colorize sliced thread and add print_latex(ThrMMA)

Allow colorize only one thread of `print_latex`, to make mma pattern obvious and reduce eye strain. For example, ```cpp #include "cute/tensor.hpp" using namespace cute; int main() { auto tiled_mma =...

inactive-30d

[BUG] Non-ZFILL CP_ASYNC copy trait cause buffer overwriting with predicated copy

**Describe the bug** ```cuda #include "cute/tensor.hpp" using namespace cute; __global__ void kernel(int *gmem) { int tid = threadIdx.x; gmem[tid * 4 + 0] = tid * 4 + 0; gmem[tid...

bug

? - Needs Triage

inactive-30d

[CUDA] Page Attention

The set of kernels was integrated in our fork of vllm previously, now porting them to onnxruntime as a native OpKernel for potential future development. Feature set: Decomposed scheduler and...

[QNN] fix Clip and ReLU removal logic

1. Clip can be removed iff the codoamin of QuantizeLinear remain unchanged. 2. To remain unchanged, y=QuantizeLinear(Clip(x)) must span the full range of values that can be represented by the...