hivemind
hivemind copied to clipboard
[Feature Request] Recurrent Depth Latent Reasoning
Potentially significant implications for scaling performance of distributed inference. Potentially greater implications for distributing inference than a naive implementation (An initial thought/guess; citation needed). Transformers has it via:
The model requires its own KV-cache implementation
HuginnDynamicCache, otherwise the KV-caches of later calls to the recurrent block will overwrite the earlier ones.
but no idea if this makes sacrifices/unrealized potential.
Having recently read https://github.com/bigscience-workshop/petals/issues/483 and listening to the pod got me curious about it. The are the obvious benefits but I'm wondering more about distributing inference for a single request. It's a pipe-dream until it isn't.
Papers
https://arxiv.org/abs/2502.05171 https://arxiv.org/abs/2402.14020
POC Model: https://huggingface.co/tomg-group-umd/huginn-0125
Code
https://github.com/seal-rg/recurrent-pretraining
https://github.com/gair-nlp/prox
Interview Pod: https://www.youtube.com/watch?v=dY90DXLi0vk
easy