mlx-examples
mlx-examples copied to clipboard
RotatingKVCache: Problem when reusing cache between multiple generations
The design of RotatingKVCache assumes that, text generation starts with a long prompt first, and then continues token by token, and then end.
But in chat apps the program usually work like this:
cache = create_cache();
while (true) {
new_messages = wait_input_from_user();
generate(new_messages, model, cache);
}
which does multiple generations with the same cache, and when using RotatingKVCache in such apps it would return wrong results after 1st generate call.
That's a good point... we should probably fix it.