FireRedASR
FireRedASR copied to clipboard
cache enc kv proj for cross-attention
The kv-projection in cross-attention is calculated in every decoding step which is redundant since encoder_outputs doesn't change during whole decoding phase, this PR add a simple caching mechanism in cross-attn to avoid recomputing. in my test case (batch-size=32 beam-size=3 audio-length=20s), the e2e latency reduced from 20seconds to 6.1 seconds on H20.
Thanks for your PR, we will review.