mlx-swift-examples icon indicating copy to clipboard operation
mlx-swift-examples copied to clipboard

Dynamic maybeQuantizeKVCache causes cache COW in TokenIterator.

Open mzbac opened this issue 5 months ago • 7 comments

When I work on the prompt cache with the KV quant cache, I've noticed maybeQuantizeKVCache converts the SimpleKVCache to the quantized KV cache. However, the cache reference passed to TokenIterator still remains as SimpleKVCache. It seems this causes a COW situation where TokenIterator starts maintaining its own cache instead of using the reference from the passed-in cache. This cause an issue where external code can't access the updated KV cache from TokenIterator.

mzbac avatar Jun 27 '25 11:06 mzbac

Are you testing this with https://github.com/ml-explore/mlx-swift-examples/pull/338?

DePasqualeOrg avatar Jun 27 '25 12:06 DePasqualeOrg

I am testing the Qwen3 model, which has KV cache quantization support. At the moment, I have to set quantizedKVStart = Int.max to disable maybeQuantizeKVCache and do KV quantization outside of the generation step. Otherwise, it keeps getting the SimpleKVCache instead of the QuantizedKVCache.

https://github.com/mzbac/swift-mlx-server/blob/main/Sources/chatCompletionsRoute.swift#L235-L258

mzbac avatar Jun 27 '25 12:06 mzbac

Before submitting my PR for review, I tested it and noticed a performance difference with the quantized KV cache, so I believe it was working for me. I don't have time to debug this, but you can do so by adding print statements throughout to see where it's going wrong.

DePasqualeOrg avatar Jun 27 '25 12:06 DePasqualeOrg

Yeah, I did debug through the code, it does look like causing COW the cache array pass to token iterator is not getting updated, unless I missed something.

And this wouldn't be issue during the token iterator lifecycle. Since its copy of cache get updated and using quantized kv cache

mzbac avatar Jun 27 '25 12:06 mzbac

I can't make sense of what you wrote. Perhaps you can link to a fork with debug statements and comments pointing to exactly what you're seeing.

DePasqualeOrg avatar Jun 27 '25 12:06 DePasqualeOrg

I think I see the issue. If the caller is passing in a KVCache to the TokenIterator (e.g. through generate() or ChatSession), the values get passed in as [KVCache].

TokenIterator then may apply quantization:

    mutating func step(previous: LMInput.Text) -> MLXArray {
        let result = model(
            previous[text: .newAxis], cache: cache.isEmpty ? nil : cache, state: state)
        self.state = result.state

        // Apply dynamic cache quantization after each step
        maybeQuantizeKVCache(
            cache: &cache,
            kvBits: kvBits,
            kvGroupSize: kvGroupSize,
            quantizedKVStart: quantizedKVStart
        )

and this mutates th [KVCache], but because this is a value type (Array) holding reference types (KVCache) it isn't visible to the caller.

This means that chat sessions won't maintain their history for models that use this.

We need one of these solutions:

  • don't pass [KVCache], rather pass KVCacheContainer which is a reference type and has-a [KVCache]
  • a copy-in / copy-out scheme for the cache parameter

I think I prefer the former as it is more clear what is going on. The latter looks like it is value types but it is only just barely. It is really the container holding the KVCache that is a value type.

davidkoski avatar Jun 27 '25 15:06 davidkoski

Thanks @davidkoski. That's the core issue. Sorry wasn't able to explain it more clearly, was busy looking at Flux1 context stuff yesterday. I also prefer using the container instead of the value type, not sure why it was made as a value type in the first place.

mzbac avatar Jun 28 '25 14:06 mzbac