jq-wei comments

Repositories
Issues
Comments

Results 1 comments of


                                            jq-wei

Question: is key_state_compressed used for inference?

Especially, after prefilling, there is one attention loop for seq_len - (self.max_capacity_prompt) +1 many tokens, what is this for? After this, decoding starts, but seems using the full KV cache.