cache enc kv proj for cross-attention

Open tingqli opened this issue 1 month ago • 1 comments

The kv-projection in cross-attention is calculated in every decoding step which is redundant since encoder_outputs doesn't change during whole decoding phase, this PR add a simple caching mechanism in cross-attn to avoid recomputing. in my test case (batch-size=32 beam-size=3 audio-length=20s), the e2e latency reduced from 20seconds to 6.1 seconds on H20.

Nov 15 '25 09:11 tingqli

Thanks for your PR, we will review.

Nov 24 '25 05:11 kaituoxu