[Discussion] Appreciation for the elegant training mechanism (DEQ approximation + Deep Supervision)
First off, thank you for this fantastic paper and open-sourcing the code. The architecture is incredibly elegant, and I believe the way the model learns is just as innovative as the hierarchical structure itself.
I was particularly struck by the interplay between two core ideas:
- The use of a one-step gradient approximation, theoretically grounded as a first-order approximation of the DEQ/IFT gradient, to avoid the computational and memory costs of BPTT. 🥇
- The deep supervision loop (the M-loop), where the model performs multiple forward/backward passes on the same example, starting each new session from the hidden state of the last. 👏🏼👏🏼👏🏼
On its own, the one-step gradient seems myopic; it can't easily assign credit for errors that occurred early in a long computational chain. However, it seems the deep supervision loop brilliantly solves this. By making the flawed hidden state from the end of session m-1 the starting point for session m, any "buried" errors are brought into the present, where the one-step gradient can finally see them and correct the underlying logic.
This feels like a foundational insight into how to train deep, recurrent reasoning systems efficiently. Was this interplay - using the deep supervision loop to directly compensate for the myopic nature of the one-step gradient - a deliberate design from the start, or an emergent property discovered during experimentation?
I was so inspired by this design that I wrote a detailed blog post attempting to explain these inner workings with analogies, aimed at helping a broader audience appreciate the depth of the work. I'm sharing it here in case it's useful for others trying to understand the model's mechanics.
Thanks again for the wonderful contribution to the field.
-
Although the current "completion-level" of a task can be inferred(simplified by linear + activation [+BCE]) from latent_high (z_H), this architectural design: task-solving is "tightly coupled" with task-completion status in same network is clearly not ideal. It imposes significant limitations on the "types of tasks" that can be handled.
-
Although recurrence and attention mechanisms, and even hierarchical structures, are important characteristics of highly intelligent models, they are clearly not sufficient. Beyond widely used routing strategies, long-short “explicit” memory, I think, fundamental problem solving patterns such as """looped involving (associate/search + try/execute + estimate)""" are difficult to bypass.
能否使用MoR(基于transformer架构) + 扩散模型, 的方式来模拟HRM?
Can we use MoR (Mixture of Recursions, Transformer-based architecture) + Diffusion architecture to simulate HRM?