torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

Generate with KV-cache enabled vs. not enabled gives different results

Open joecummings opened this issue 1 year ago • 2 comments

We would expect that the only different between enabling a kv-cache for a model in generation is the speed of decoding; however, in experiments with commenting out with device: model.setup_caches() in our generate.py recipe, the output is garbage.

Needs more investigation.

joecummings avatar May 10 '24 19:05 joecummings

You might need to change the incremental_decode in the generation function?

rohan-varma avatar May 10 '24 21:05 rohan-varma

@joecummings ~~I'm guessing this is because the causal mask is created in setup_caches() here, so without calling this function we're attending to all tokens, resulting in garbage outputs. Maybe we should move this mask initialization into __init__?~~

Nevermind, this line takes care of the causal mask if it's missing.

calvinpelletier avatar May 15 '24 19:05 calvinpelletier