[WIP] Rolling KV cache for autoregressive generation

Open zucchini-nlp opened this issue 1 month ago • 1 comments

What does this PR do?

Fixes https://github.com/huggingface/diffusers/issues/12600

Functionality-wise the self attention cache seems to work correctly, cross-attention has to be added and verfied. I added Krea to test the cache though I am not getting the same output as the original model yet. From quick debugging, looked to be related to timesteps or rope embeddings. Opening a draft as a reminder to myself to give this feature higher priority

Dec 02 '25 10:12 zucchini-nlp

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Dec 02 '25 10:12 HuggingFaceDocBuilderDev