mlx-swift-examples
mlx-swift-examples copied to clipboard
Dynamic maybeQuantizeKVCache causes cache COW in TokenIterator.
When I work on the prompt cache with the KV quant cache, I've noticed maybeQuantizeKVCache converts the SimpleKVCache to the quantized KV cache. However, the cache reference passed to TokenIterator still remains as SimpleKVCache. It seems this causes a COW situation where TokenIterator starts maintaining its own cache instead of using the reference from the passed-in cache. This cause an issue where external code can't access the updated KV cache from TokenIterator.
Are you testing this with https://github.com/ml-explore/mlx-swift-examples/pull/338?
I am testing the Qwen3 model, which has KV cache quantization support. At the moment, I have to set quantizedKVStart = Int.max to disable maybeQuantizeKVCache and do KV quantization outside of the generation step. Otherwise, it keeps getting the SimpleKVCache instead of the QuantizedKVCache.
https://github.com/mzbac/swift-mlx-server/blob/main/Sources/chatCompletionsRoute.swift#L235-L258
Before submitting my PR for review, I tested it and noticed a performance difference with the quantized KV cache, so I believe it was working for me. I don't have time to debug this, but you can do so by adding print statements throughout to see where it's going wrong.
Yeah, I did debug through the code, it does look like causing COW the cache array pass to token iterator is not getting updated, unless I missed something.
And this wouldn't be issue during the token iterator lifecycle. Since its copy of cache get updated and using quantized kv cache
I can't make sense of what you wrote. Perhaps you can link to a fork with debug statements and comments pointing to exactly what you're seeing.
I think I see the issue. If the caller is passing in a KVCache to the TokenIterator (e.g. through generate() or ChatSession), the values get passed in as [KVCache].
TokenIterator then may apply quantization:
mutating func step(previous: LMInput.Text) -> MLXArray {
let result = model(
previous[text: .newAxis], cache: cache.isEmpty ? nil : cache, state: state)
self.state = result.state
// Apply dynamic cache quantization after each step
maybeQuantizeKVCache(
cache: &cache,
kvBits: kvBits,
kvGroupSize: kvGroupSize,
quantizedKVStart: quantizedKVStart
)
and this mutates th [KVCache], but because this is a value type (Array) holding reference types (KVCache) it isn't visible to the caller.
This means that chat sessions won't maintain their history for models that use this.
We need one of these solutions:
- don't pass
[KVCache], rather passKVCacheContainerwhich is a reference type and has-a[KVCache] - a copy-in / copy-out scheme for the
cacheparameter
I think I prefer the former as it is more clear what is going on. The latter looks like it is value types but it is only just barely. It is really the container holding the KVCache that is a value type.
Thanks @davidkoski. That's the core issue. Sorry wasn't able to explain it more clearly, was busy looking at Flux1 context stuff yesterday. I also prefer using the container instead of the value type, not sure why it was made as a value type in the first place.