Zhuobin Huang

Results 9 comments of Zhuobin Huang

![geglu](https://user-images.githubusercontent.com/63904514/199904551-46dfdc0d-3eac-4dee-b91a-d566e2020f51.png)

**Naive Implementation Latency Breakdown** - **Testcase**: "m": 256, "k": 1280, "n": 5120 |CUDA API| Duration| |-|-| |`cudaMemcpyAsync`|14.628 ms| |`cudaMemcpyAsync`|45.305 μs| |`Kernel`|75.333 μs| |`FusedGegluForwardGpu`|13.703 μs| - **Testcase**: "m": 1024, "k": 640,...

**Non-split Fused GELU Op** ![fused_gelu (non-split)](https://user-images.githubusercontent.com/63904514/200749636-bdd93824-0b26-444f-b0e0-3bc627867246.png)

**Split Fused GELU Op** ![fused_gelu (split)](https://user-images.githubusercontent.com/63904514/200749703-f9fb4057-4975-4402-9cb4-fb63347884e5.png)

![image](https://user-images.githubusercontent.com/63904514/203075211-24cb4b5a-0e2d-41ec-9a2f-a9d5e8ed26cd.png)

Got the same issue here, with official CUDA 12.8 container environment

I ran into this issue while building flashinfer with PyTorch 2.6 (CUDA: 12.6). Solved it by downgrading to PyTorch 2.6 (CUDA: 12.4), i.e., need to align the CUDA version of...