aciddelgado issues

Results 6 issues of


                                            aciddelgado

Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator

I’ve discovered a performance gap between the Neural Speed Matmul operator and the Llama.cpp operator in the Neural-Speed repository. This issue was identified while running a benchmark with the ONNXRuntime-GenAI...

enhancement

Is it possible to access the intermediate calculation of q * k multiplication with Flash Attention?

As the title says, is this scenario possible right now or is it on the roadmap?

Allow Memory Efficient Attention Kernel to run when local window size is set

### Description This PR introduces a slight change to the handling of the Local Window Size parameter in the context of Memory Efficient Attention. Previously, setting the Local Window Size...

Fix num splits bug

### Description Found a bug with num splits where the heuristic isn't being performed properly due to incorrect passing of sequence length to heuristic function. ### Motivation and Context We...

Add Interactive Decoding support in GQA

### Description This PR will support for Interactive Decoding via the use of a 2-D seqlens_k tensor, which holds the past and total sequence lengths of each sequence in a...

softcap gqa

### Description Implement softcap for gqa. ### Motivation and Context Fixes certain models like Gemma-2 which need softcap to work so they don't output nan's.