Test with a New Puzzle out of the Dataset?
It is really unbeliveable to solve Sudoku instantly without coding and CoT (no use at all). However, I worry that the evaluation uses the puzzles included in the train sets. To verify if HRM really has the tremendous superiority, I think we should input Sudoku puzzles that are completely different from the puzzles within the datasets used by this project. I think it will be a real milestone if HRM can solve even just several ones, considering no advanced LLMs now can solve even one single Sudoku without coding.
We've run the handcrafted Sudoku set from https://github.com/SakanaAI/Sudoku-Bench
The released 1000 example checkpoint achieved 92%. The following is complete solution process of Sudoku-Bench.
OMG...what have you guys cooked! seriously!?!
The cases in which if failed, it run for exactly 16 steps This is due to a hard limit:
https://github.com/sapientinc/HRM/blob/4047578a02e5deba975c38a1f32392547e66c071/config/arch/hrm_v1.yaml#L7
I wonder if it kept going for a little more steps, it could have solved those as well.
Off-course that brute-force solves all puzzles, I am talking about 20 or 24 steps.
We've run the handcrafted Sudoku set from https://github.com/SakanaAI/Sudoku-Bench
The released 1000 example checkpoint achieved 92%. The following is complete solution process of Sudoku-Bench.
Thank you. You guys really made it. Magnificent work.
I update the logits to output the readable results. Now I have confirmed that my trained model on aug-1k is a speedy model with a tremendous accuracy to solve Sudoku problems. Really remarkable!
Now I believe we should see if HRM can scale in natural language tasks.