HRM Question: about the role of Q learning in ACT

Hello,

First, I would like to thank the authors for this amazing work, the ideas in it are very inspiring.

I had a question about something I didn't get in the paper: the role of Q learning. From what I understood, it is used to learn when to stop looping over the recurrent part of the network. However, I don't understand why it is necessary: if I understand correctly, the state is expected to converge toward a fixed point, so I guess that it is possible to halt when the residual's magnitude is close to zero, without having to learn anything related to stopping.

Is that correct or am I misunderstanding? Did you do ablations for this part?

Thank you in advance for your time, Pierre-Louis

Aug 09 '25 10:08 p-nordmann

Quite an insightful question that gets to the heart of the model's design. You are right... one could use a simple fixed-point check, but here is why the learned Q-learning (ACT) approach is so powerful:

It Avoids Premature Halting on Deceptive Problems: Many complex reasoning tasks have "deceptive local minima" - states where the model's internal state seems stable for a moment (a low residual), but it's actually just a temporary pause before a much deeper chain of logic is required. A simple residual check would halt here prematurely, while a learned ACT policy can learn that these specific "stable" states are traps and that it's worth continuing to think to achieve a better long-term reward.
It Allows for Strategic Backtracking and Hypothesis Testing: The system isn't just converging to one single answer. The M-loop (Thinking Sessions) is designed for the model to test different high-level hypotheses. The Q -learning policy learns the value of abandoning one line of reasoning (even if it looks "converged") to try another, more promising one. This is crucial for solving problems that require backtracking, which a simple fixed-point check cannot manage.
It Adapts to Task Difficulty More Robustly: While a residual check provides a stop signal, the ACT policy learns a much richer, context-dependent signal. It learns not just that the state has stopped changing, but whether this is a desirable state to stop in. This allows it to dynamically "think fast" on easy problems where convergence is quick and meaningful, and "think slow" on hard problems, pushing through those deceptive plateaus that would fool a simpler halting mechanism.

Wrote a detailed blogpost about the model with the sudoku example...hope it helps: https://medium.com/@gedanken.thesis/the-loop-is-back-why-hrm-is-the-most-exciting-ai-architecture-in-years-7b8c4414c0b3

Aug 10 '25 05:08 narvind2003

@narvind2003 Thanks for the LLM word salad.

Aug 10 '25 15:08 davtoro

Apologies - just wanted to help. This is exciting for all of us, and I might have gotten carried away.

Aug 11 '25 03:08 narvind2003

Apologies - just wanted to help. This is exciting for all of us, and I might have gotten carried away.

No worries, happens to the best of us.

Aug 11 '25 03:08 davtoro