hivemind icon indicating copy to clipboard operation
hivemind copied to clipboard

[Feature Request] Recurrent Depth Latent Reasoning

Open bitnom opened this issue 7 months ago • 0 comments

Potentially significant implications for scaling performance of distributed inference. Potentially greater implications for distributing inference than a naive implementation (An initial thought/guess; citation needed). Transformers has it via:

The model requires its own KV-cache implementation HuginnDynamicCache, otherwise the KV-caches of later calls to the recurrent block will overwrite the earlier ones.

but no idea if this makes sacrifices/unrealized potential.

Having recently read https://github.com/bigscience-workshop/petals/issues/483 and listening to the pod got me curious about it. The are the obvious benefits but I'm wondering more about distributing inference for a single request. It's a pipe-dream until it isn't.

Papers

https://arxiv.org/abs/2502.05171 https://arxiv.org/abs/2402.14020

POC Model: https://huggingface.co/tomg-group-umd/huginn-0125

Code

https://github.com/seal-rg/recurrent-pretraining

https://github.com/gair-nlp/prox

Interview Pod: https://www.youtube.com/watch?v=dY90DXLi0vk

easy

bitnom avatar Mar 17 '25 21:03 bitnom