Reproducibility results of Sudoku-Extreme and ARC-AGI 1
We at @HigherOrderCO decided to reproduce the HRM results because of the very interesting result specially when comparing the total compute time against other models / architectures (such as LLMs)
At first, we choose to run the small experiment of Sudoku-Extreme 9x9. We used 1 H200 GPUs and the training time was approximately one hour or so.
The training process was exactly the one described in the README, with:
OMP_NUM_THREADS=1 torchrun --nproc-per-node 1 pretrain.py data_path=data/sudoku-extreme-1k-aug-1000 epochs=20000 eval_interval=2000 lr=1e-4 puzzle_emb_lr=1e-4 weight_decay=1.0 puzzle_emb_weight_decay=1.0
and evaluation:
OMP_NUM_THREADS=1 torchrun --nproc-per-node 1 evaluate.py checkpoint=checkpoints/Sudoku-extreme-1k-aug-1000\ ACT-torch/HierarchicalReasoningModel_ACTV1\ loose-caracara/step_26040
As evaluation results, we got:
- 45,8% of accuracy (10% less then the 55% reported in the paper)
- perfect halting accuracy
- 27275266 parameters
Then, we started a runtime to reproduce the ARC-AGI-1 experiment. We used 8 H200 GPUs and the runtime took roughly 24 hours.
Built the dataset with:
python dataset/build_arc_dataset.py
And the training with: OMP_NUM_THREADS=8 torchrun --nproc-per-node 8 pretrain.py
Finally, the evaluation with:
OMP_NUM_THREADS=8 torchrun --nproc-per-node 8 evaluate.py checkpoint=<CHECKPOINT_PATH>
We got the results:
- ~25% accuracy (15% less then the reported 40%)
- 58% halting accuracy
- 27276290 params
With this we successfully reproduced the HRM experiment. Now, the only question that remains from my end is why do we got 10% less in sudoku and 15% less in ARC. Since I saw a tweet from someone from here saying the compute time from arc was from 50 ~ 200 hours (setup not shared, not sure which GPUs) I assume they run the training longer / slightly changed the setup.
Anyway it's surely interesting that they get 25% with 960 examples and 24 hours of training time.
Thanks for your reproduction run! The evaluate.py does not handle majority voting and is only 1-shot, so it's about 15% lower. ARC-AGI allows 2 shots per task. Could you run arc_eval.ipynb to run majority voting and check the pass@2 result? That would be close to our final reported result.
Besides for Sudoku, the 1000 example training set is small, so expect a bit of variance. Try to train a little bit longer, and early stop just before overfitting. We observed a standard deviation of about 2%.
I made 2 runs for Sudoku Extreme 1k on a RTX 5090 and the results match
1st run: 57.49% 2nd run: 55.37%
First run:
$ WANDB_MODE=offline OMP_NUM_THREADS=96 torchrun --nproc-per-node 1 pretrain.py data_path=data/sudoku-extreme-1k-aug-1000 epochs=20000 eval_interval=20000 lr=1e-4 puzzle_emb_lr=1e-4 weight_decay=1.0 puzzle_emb_weight_decay=1.0 > training-1.log 2>&1 &
disown
...
$ cat training-1.log
100%|██████████| 26041/26041 [1:24:57<00:00, 5.15it/s]wandb:
wandb:
wandb: Run history:
wandb: num_params ▁
wandb: train/accuracy ▁▁▁▆▆▆▆▆▆▆▆▆▆▆▆▆▆▇▇▆█▇▇▇▇▆▇██▇██████▇███
wandb: train/count ▁███████████████████████████████████████
wandb: train/exact_accuracy ▁▁▁▁▁▁▁▁▁▁▂▁▂▃▂▃▂▄▃▄▄▄▄▄▃▆▃▃▆▆▆▅▆▆██▇▇▇█
wandb: train/lm_loss ██▅▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▃▄▄▃▃▄▃▃▃▃▁▃▃▂
wandb: train/lr ▁▁▄█████████████████████████████████████
wandb: train/q_continue_loss ▁▁▁▁▁▁▂▂▃▃▄▃▅▅█▆▆▆▇▆▇▆▇▇▇▆▆▆▆▄▆▅▆▅▅▅▅▄▄▄
wandb: train/q_halt_accuracy ▁▁▁█████████████████████████████████████
wandb: train/q_halt_loss ▂▁▁▁▁▁▄▃▃▂▃▂▂▄▃▂▃▁▂▂▄▃▃▂▄▃▃▂▃▃▄▃▂█▂▂▃▂▅▄
wandb: train/steps ▁▁▁▁▁██▇▇▇█▆▇█▇▇▇▇▇▇█▆▆▆█▆▅▅██▅▄▄▄▄▅▅▄▃▇
wandb:
wandb: Run summary:
wandb: num_params 27275266
wandb: train/accuracy 0.95997
wandb: train/count 1
wandb: train/exact_accuracy 0.86364
wandb: train/lm_loss 0.42158
wandb: train/lr 0.0001
wandb: train/q_continue_loss 0.20734
wandb: train/q_halt_accuracy 1
wandb: train/q_halt_loss 0.02121
wandb: train/steps 8.39394
Eval:
$ OMP_NUM_THREADS=1 torchrun --nproc-per-node 1 evaluate.py checkpoint=checkpoints/Sudoku-extreme-1k-aug-1000\ ACT-torch/HierarchicalReasoningModel_ACTV1\ splendid-centipede/step_26041
Starting evaluation
{'all': {'accuracy': np.float32(0.8414718), 'exact_accuracy': np.float32(0.57494336), 'lm_loss': np.float32(0.3906889), 'q_halt_accuracy': np.float32(0.9982686), 'q_halt_loss': np.float32(0.015446215), 'steps': np.float32(16.0)}}
Run 2:
$ WANDB_MODE=offline OMP_NUM_THREADS=96 torchrun --nproc-per-node 1 pretrain.py data_path=data/sudoku-extreme-1k-aug-1000 epochs=20000 eval_interval=20000 lr=1e-4 puzzle_emb_lr=1e-4 weight_decay=1.0 puzzle_emb_weight_decay=1.0 > training-2.log 2>&1 &
disown
...
$ cat training-2.log
100%|██████████| 26041/26041 [1:24:37<00:00, 5.15it/s]wandb:
wandb:
wandb: Run history:
wandb: num_params ▁
wandb: train/accuracy ▁▁▆▆▆▆▆▆▆▆▆▇▆▇▇▇▇▇▇▇▇█▇▇██▇▆▇███████████
wandb: train/count █▁▁▁▁███████████████████████████████████
wandb: train/exact_accuracy ▁▁▁▁▁▁▁▂▂▄▂▃▃▃▄▄▅▄▄▄▅▆▅▅▅▅▇▅▅▅▇▆▆▆▃▇▇███
wandb: train/lm_loss █▅▅▅▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁
wandb: train/lr ▁▇██████████████████████████████████████
wandb: train/q_continue_loss ▁▁▁▂▁▁▂▂▃▃▂▂▃▃▃▃▄▃▅▅▆▅▇█▆█▃▇█▅▅▇▅▇▅▆▅▅▇▇
wandb: train/q_halt_accuracy ▁▁▁▇████████████████████████████████████
wandb: train/q_halt_loss ▁▁▁▁▆▄██▃▅▄▃▄▂▅▅▄▄▇▅▃▄▄▆█▂▆▇▃▇▅▇▃▆▄▇▅▄▆▅
wandb: train/steps ▁▁▁██████▇▇▇█▇▇▅▆▆▆█▆▅▆▅█▅▅▅▅▄▇▅▅▄▅▅▄▄▄█
wandb:
wandb: Run summary:
wandb: num_params 27275266
wandb: train/accuracy 0.95764
wandb: train/count 1
wandb: train/exact_accuracy 0.85345
wandb: train/lm_loss 0.4965
wandb: train/lr 0.0001
wandb: train/q_continue_loss 0.29144
wandb: train/q_halt_accuracy 1
wandb: train/q_halt_loss 0.02581
wandb: train/steps 6.18966
Eval:
$ OMP_NUM_THREADS=1 torchrun --nproc-per-node 1 evaluate.py checkpoint=checkpoints/Sudoku-extreme-1k-aug-1000-ACT-torch/HierarchicalReasoningModel_ACTV1_run2/step_26041
Starting evaluation
{'all': {'accuracy': np.float32(0.83546376), 'exact_accuracy': np.float32(0.5537861), 'lm_loss': np.float32(0.40457007), 'q_halt_accuracy': np.float32(0.9962913), 'q_halt_loss': np.float32(0.031623095), 'steps': np.float32(16.0)}}
People online are now saying that the HRM paper has data leaks, so it would be good if people can replicate the results but with better train-test splits https://github.com/sapientinc/HRM/issues/45
Comments from Gabriel Mongaras' viewership
They trained and tested both on augmentations of the ARC puzzles, which is basically leaking. I tried both this and removing the H modules and accumulate losses over the cycles, and did not find any improvement over vanilla transformer (I tested on a noisy time series dataset). Adding recurrent depth direction does not seem to improve the model capability, which is disappointing. Edit: I think adding recurrent structures to transformer models might still be useful to reasoning type problems. It just didn’t work for my use case.
Here is a tip I can share and I'm sorry the whole thing can't be shared. We needed to gather better metrics and the evaluate() function in pretrain.py has a design flaw the makes the training to eval ratio very low as it's almost 1:99 per batch where eval is eating processing, flips to 99:1 in trainings favor when you keep from inferencing 1000 puzzles it can't solve. Batch size matters! They do warn about this, but before any testing is done, analysis should be done as well of the code. On regular evaluations (less compute time), you can get a sense of progression that is masked otherwise as the model is being force feed puzzles with no check of how well its doing, quickly. We don't do any more than 14 seconds of evaluation on average, My other suggestion also, wandb disabled.
` # Very aggressive early stopping - prioritize training time over evaluation completeness if puzzles_checked >= 64: # Check after 2 batches minimum failure_rate = failures / puzzles_checked step = train_state.step
# Training-focused thresholds - stop early unless model is clearly working well
if step < 500:
max_failure_rate = 0.70 # 70% failures acceptable - very early training
elif step < 2000:
max_failure_rate = 0.60 # 60% failures acceptable - early training
elif step < 5000:
max_failure_rate = 0.50 # 50% failures acceptable - mid training
elif step < 20000:
max_failure_rate = 0.40 # 40% failures acceptable - later training
else:
max_failure_rate = 0.30 # 30% failures acceptable - advanced training
`
The model is pretty powerful; there are going to be vast use cases and a clear reduction in training. Training will be transformed.