candle fix kv cache issue with quantized

fix kv cache issue with quantized_phi3 implementation

Open ljt019 opened this issue 7 months ago • 0 comments

The current implementation of the quantized_phi3 model does not clear its kv cache between distinct prompts. This leads to errors when attempting to generate text sequentially with the same model instance.

If you try to prompt the model more than once you get an shape error like this:

cannot broadcast [12, 12] to [1, 40, 12, 124]

I double checked quantized gemma3 to make sure that you guys do usually clear the cache and you did, so I went ahead and figured out how to make the fix, it's just a few lines of code to use the .reset method on the KvCache when pos is 0. I already tested it locally and it seems that I have no issues sending multiple prompts sequentially to phi3/phi4 now :)

May 01 '25 21:05 ljt019

candle candle copied to clipboard

fix kv cache issue with quantized_phi3 implementation

candle
candle copied to clipboard