Generate with KV-cache enabled vs. not enabled gives different results
We would expect that the only different between enabling a kv-cache for a model in generation is the speed of decoding; however, in experiments with commenting out with device: model.setup_caches() in our generate.py recipe, the output is garbage.
Needs more investigation.
You might need to change the incremental_decode in the generation function?
@joecummings ~~I'm guessing this is because the causal mask is created in setup_caches() here, so without calling this function we're attending to all tokens, resulting in garbage outputs. Maybe we should move this mask initialization into __init__?~~
Nevermind, this line takes care of the causal mask if it's missing.