open-instruct The KPI of allenai/OLMo-2-0425-1B-SFT cannot be reproduced

Your open source work is very nice.Could you share the hyperparat reproameters thduce OLMo-2-0425-1B instruct in the document? I encountered the problem of KPI during the reproduction process, especially math. I tried a lot of lr. And the following parameters were used.

--num_nodes 4 \
--gpus 8
--mixed_precision bf16 \
--model_name_or_path allenai/OLMo-2-0425-1B \
--tokenizer_name allenai/OLMo-2-0425-1B \
--use_slow_tokenizer False \
--use_flash_attn \
--max_seq_length 4096 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 4 \
--learning_rate 2e-4 \
--lr_scheduler_type linear \
--warmup_ratio 0.03 \
--weight_decay 0.0 \
--num_train_epochs 2 \
--reduce_loss sum \
--model_revision main \
--dataset_mixer_list allenai/tulu-3-sft-olmo-2-mixture-0225 1.0 \
--add_bos \
--seed 123 \
--push_to_hub False \
--try_launch_beaker_eval_jobs False

The KPIs are as follows "all_primary_scores": [ "gsm8k::tulu: 0.430629", "drop::llama3: 0.322801", "ifeval::tulu: 0.471349", "mmlu:rc::olmes: 0.377568", ],

The KPIs in the document are as follows https://huggingface.co/allenai/OLMo-2-0425-1B-Instruct "all_primary_scores": [ "gsm8k::tulu: 52.1", "drop::llama3: 33.8", "ifeval::tulu: 50.5"， "mmlu:rc::olmes","36.4'". ],

May 26 '25 02:05 RobinsonKO

Hey @RobinsonKO -- looking into this. For example the sort of command ran with is this -- note I didn't check the exact hyperparameters, copied from one of the SFT experiments:

python mason.py \
    --cluster ai2/augusta-google-1 \
    --workspace ai2/olmo-instruct \
    --priority high \
    --image nathanl/open_instruct_auto --pure_docker_mode \
    --preemptible \
    --num_nodes 1 \
    --budget ai2/oe-adapt \
    --gpus 8 -- accelerate launch \
    --mixed_precision bf16 \
    --num_processes 8 \
    --use_deepspeed \
    --deepspeed_config_file configs/ds_configs/stage3_no_offloading_accelerate.conf \
    --deepspeed_multinode_launcher standard \
    open_instruct/finetune.py \
    --exp_name olmo2_1b_sft \
    --model_name_or_path allenai/OLMo-2-0425-1B \
    --model_revision main \
    --tokenizer_name allenai/OLMo-2-1124-7B \
    --tokenizer_revision main \
    --use_slow_tokenizer False \
    --add_bos \
    --dataset_mixer_list allenai/tulu-3-sft-olmo-2-mixture-0225 1.0 \
    --use_flash_attn \
    --max_seq_length 4096 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --learning_rate 2e-5 \
    --lr_scheduler_type linear \
    --warmup_ratio 0.03 \
    --weight_decay 0.0 \
    --num_train_epochs 2 \
    --reduce_loss sum \
    --report_to wandb \
    --with_tracking \
    --logging_steps 1 \
    --seed 8

I'll improve the docs soon (like this: https://github.com/allenai/open-instruct/blob/main/docs/tulu3.md)

How did you evaluate your models?

May 29 '25 21:05 natolambert

Thank you for the reply, I will try this setting. And hope for you docs. By the way, I use the olmes for evaluation.

Jun 03 '25 05:06 RobinsonKO

Can you say more @RobinsonKO ? I bet the OLMES repo has become outdated to our setup a bit (you can see its not updated often), so it'll be hard to track down. I am adding the exact commands used to train the models in #703

Jun 03 '25 20:06 natolambert