InfLLM
InfLLM copied to clipboard
`Position Emb` and `Chunk size`
Great job, I found two problems when trying to reproduce the paper's results.
-
The same positiona embedding was used for all context memory units as explained in the paper. But I found in code implementation, there seems no use of position embedding for cached Ks at all?
-
Why
chunk size
? The proposed method does the attention block by block, which (I think) wouldn't cause OOM errors even without the chunking trick in decoding. But I found it fail to process 100K text without settingchunk size
, while usingflash attn
is totaly fine in such circumstances.