Generate with KV-cache enabled vs. not enabled gives different results

Open joecummings opened this issue 1 year ago • 2 comments

We would expect that the only different between enabling a kv-cache for a model in generation is the speed of decoding; however, in experiments with commenting out with device: model.setup_caches() in our generate.py recipe, the output is garbage.

Needs more investigation.

May 10 '24 19:05 joecummings

You might need to change the incremental_decode in the generation function?

May 10 '24 21:05 rohan-varma

@joecummings ~~I'm guessing this is because the causal mask is created in setup_caches() here, so without calling this function we're attending to all tokens, resulting in garbage outputs. Maybe we should move this mask initialization into __init__?~~

Nevermind, this line takes care of the causal mask if it's missing.

May 15 '24 19:05 calvinpelletier