HRM icon indicating copy to clipboard operation
HRM copied to clipboard

Test with a New Puzzle out of the Dataset?

Open johnnyZeppelin opened this issue 5 months ago • 5 comments

It is really unbeliveable to solve Sudoku instantly without coding and CoT (no use at all). However, I worry that the evaluation uses the puzzles included in the train sets. To verify if HRM really has the tremendous superiority, I think we should input Sudoku puzzles that are completely different from the puzzles within the datasets used by this project. I think it will be a real milestone if HRM can solve even just several ones, considering no advanced LLMs now can solve even one single Sudoku without coding.

johnnyZeppelin avatar Aug 06 '25 09:08 johnnyZeppelin

We've run the handcrafted Sudoku set from https://github.com/SakanaAI/Sudoku-Bench

The released 1000 example checkpoint achieved 92%. The following is complete solution process of Sudoku-Bench.

sudoku-nikoli.pdf

imoneoi avatar Aug 06 '25 13:08 imoneoi

OMG...what have you guys cooked! seriously!?!

narvind2003 avatar Aug 06 '25 16:08 narvind2003

The cases in which if failed, it run for exactly 16 steps This is due to a hard limit:

https://github.com/sapientinc/HRM/blob/4047578a02e5deba975c38a1f32392547e66c071/config/arch/hrm_v1.yaml#L7

I wonder if it kept going for a little more steps, it could have solved those as well.

Off-course that brute-force solves all puzzles, I am talking about 20 or 24 steps.

kroggen avatar Aug 06 '25 18:08 kroggen

We've run the handcrafted Sudoku set from https://github.com/SakanaAI/Sudoku-Bench

The released 1000 example checkpoint achieved 92%. The following is complete solution process of Sudoku-Bench.

sudoku-nikoli.pdf

Thank you. You guys really made it. Magnificent work.

johnnyZeppelin avatar Aug 09 '25 01:08 johnnyZeppelin

I update the logits to output the readable results. Now I have confirmed that my trained model on aug-1k is a speedy model with a tremendous accuracy to solve Sudoku problems. Really remarkable!

Now I believe we should see if HRM can scale in natural language tasks.

johnnyZeppelin avatar Aug 09 '25 01:08 johnnyZeppelin