RotatingKVCache: Problem when reusing cache between multiple generations

Open zcbenz opened this issue 1 year ago • 1 comments

The design of RotatingKVCache assumes that, text generation starts with a long prompt first, and then continues token by token, and then end.

But in chat apps the program usually work like this:

cache = create_cache();
while (true) {
  new_messages = wait_input_from_user();
  generate(new_messages, model, cache);
}

which does multiple generations with the same cache, and when using RotatingKVCache in such apps it would return wrong results after 1st generate call.

Sep 26 '24 08:09 zcbenz

That's a good point... we should probably fix it.

Sep 28 '24 13:09 awni