candle Support Review of ConcatKvCache (#3143) and Plan for Future Adoption

Support Review of ConcatKvCache (#3143) and Plan for Future Adoption

Open DrJesseGlass opened this issue 3 weeks ago • 8 comments

Summary

PR #3143 introduces ConcatKvCache which provides 2-5x GPU speedup for autoregressive generation with no breaking changes to the API. However, it hasn't been reviewed yet.

This issue aims to:

Raise visibility of #3143 for maintainer review
Document the systemic need for this improvement across transformers
Propose a path forward once #3143 is merged

Why This Matters Now

Multiple transformer implementations (including the new SmolLM3 in #3180) currently use workarounds that ConcatKvCache would eliminate:

// Current workaround in Llama, Mistral, Phi, Qwen, SmolLM3, etc:
let (k, v) = self.kv_cache.append(&k.contiguous()?, &v.contiguous()?)?;

// With ConcatKvCache - clean and faster:
let (k, v) = self.kv_cache.append(&k, &v)?;

This pattern appears across the codebase, indicating a systemic issue that ConcatKvCache would solve.

Background: PR #3143

What It Provides

2-5x GPU speedup for autoregressive generation
No breaking changes - API-compatible with existing KvCache
Equal or better performance on all backends (GPU, CPU, WASM)
Cleaner code - eliminates .contiguous() workarounds

Current Status

Awaiting review - not yet merged
Implementation complete
Benchmarks show clear benefits
Needs maintainer attention

The Broader Problem

Current State

Every transformer implementation manually works around KV-cache limitations:

Current pattern:

let kv_cache = KvCache::new(2, 512);
// ...
let (k, v) = self.kv_cache.append(&k.contiguous()?, &v.contiguous()?)?;

With ConcatKvCache:

- let kv_cache = KvCache::new(2, 512);
+ let kv_cache = ConcatKvCache::new(2);  // Simpler API - no max_seq_len
  // ...
- let (k, v) = self.kv_cache.append(&k.contiguous()?, &v.contiguous()?)?;
+ let (k, v) = self.kv_cache.append(&k, &v)?;  // No contiguous() needed

Additional benefits:

Simpler constructor (no max sequence length parameter)
Built-in reset() method useful for generation applications
2-5x faster on GPU

Models affected:

Qwen family (qwen3.rs)
SmolLM3 (pending in #3180)

Models that would benefit:

Llama family (llama.rs)
Mistral family (mistral.rs)
Phi family (phi.rs, phi3.rs)
Gemma family (gemma.rs)
All other transformers using KV-cache

This code pattern across the codebase suggests the underlying abstraction could be improved.

Performance Impact

Users running transformers on GPU are getting 2-5x slower performance than they could be getting, simply because we're using the legacy KvCache implementation.

Proposal: Two-Phase Approach

Phase 1: Get #3143 Reviewed and Merged (Immediate)

Request maintainer review of #3143
Validate benchmarks across different models
Test integration with existing transformers
Merge once validated

Phase 2: Discuss Adoption Strategy (After #3143 Merges)

Once #3143 is merged and proven stable, we should discuss how to adopt it across the codebase:

Questions for Discussion:

Migration strategy - Should we batch-update all models or migrate incrementally?
Deprecation path - How do we handle the API change (constructor signature)?
Timeline - What's a reasonable timeline for migration?
Compatibility - Do we need any compatibility layer, or is a clean break acceptable?

Note: ConcatKvCache has a different API:

// Old API
let kv_cache = KvCache::new(num_layers, max_seq_len);

// New API (simpler - no max_seq_len parameter)
let kv_cache = ConcatKvCache::new(num_layers);

This isn't a drop-in replacement, so migration will require updating each transformer's initialization code. However, the benefits seem worth it:

2-5x GPU speedup
Simpler API (no max_seq_len)
Built-in reset() method (useful for generation)
No .contiguous() workarounds needed

I'm happy to help test integration approaches and contribute to the migration effort once we have consensus on the strategy.

Why This Needs Attention

User Impact

Users are getting suboptimal performance (2-5x slower on GPU)
No indication that performance could be better
Workaround is invisible to users but affects everyone

Developer Impact

Code duplication across models
Current Transformers violate DRY principles

Request for Action

Short Term

Can a maintainer review PR #3143?

The implementation looks solid
Would benefit multiple existing and new PRs

Medium Term (After Merge)

Make ConcatKvCache the default

Via type alias for compatibility
Update documentation
Plan migration for existing models

Clean up workarounds

Update transformers systematically
Remove .contiguous() calls
Improve code quality

Long Term

ConcatKvCache becomes standard approach
New transformers get optimal performance by default
Cleaner, more maintainable codebase

The main work is systematic cleanup across transformer implementations, which can be done incrementally.

Nov 12 '25 21:11 DrJesseGlass

candle candle copied to clipboard

Support Review of ConcatKvCache (#3143) and Plan for Future Adoption

Summary

Why This Matters Now

Background: PR #3143

What It Provides

Current Status

The Broader Problem

Current State

Performance Impact

Proposal: Two-Phase Approach

Phase 1: Get #3143 Reviewed and Merged (Immediate)

Phase 2: Discuss Adoption Strategy (After #3143 Merges)

Why This Needs Attention

User Impact

Developer Impact

Request for Action

Short Term

Medium Term (After Merge)

Long Term

candle
candle copied to clipboard