The KPI of allenai/OLMo-2-0425-1B-SFT cannot be reproduced
Your open source work is very nice.Could you share the hyperparat reproameters thduce OLMo-2-0425-1B instruct in the document? I encountered the problem of KPI during the reproduction process, especially math. I tried a lot of lr. And the following parameters were used.
--num_nodes 4 \
--gpus 8
--mixed_precision bf16 \
--model_name_or_path allenai/OLMo-2-0425-1B \
--tokenizer_name allenai/OLMo-2-0425-1B \
--use_slow_tokenizer False \
--use_flash_attn \
--max_seq_length 4096 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 4 \
--learning_rate 2e-4 \
--lr_scheduler_type linear \
--warmup_ratio 0.03 \
--weight_decay 0.0 \
--num_train_epochs 2 \
--reduce_loss sum \
--model_revision main \
--dataset_mixer_list allenai/tulu-3-sft-olmo-2-mixture-0225 1.0 \
--add_bos \
--seed 123 \
--push_to_hub False \
--try_launch_beaker_eval_jobs False
The KPIs are as follows "all_primary_scores": [ "gsm8k::tulu: 0.430629", "drop::llama3: 0.322801", "ifeval::tulu: 0.471349", "mmlu:rc::olmes: 0.377568", ],
The KPIs in the document are as follows https://huggingface.co/allenai/OLMo-2-0425-1B-Instruct "all_primary_scores": [ "gsm8k::tulu: 52.1", "drop::llama3: 33.8", "ifeval::tulu: 50.5", "mmlu:rc::olmes","36.4'". ],
Hey @RobinsonKO -- looking into this. For example the sort of command ran with is this -- note I didn't check the exact hyperparameters, copied from one of the SFT experiments:
python mason.py \
--cluster ai2/augusta-google-1 \
--workspace ai2/olmo-instruct \
--priority high \
--image nathanl/open_instruct_auto --pure_docker_mode \
--preemptible \
--num_nodes 1 \
--budget ai2/oe-adapt \
--gpus 8 -- accelerate launch \
--mixed_precision bf16 \
--num_processes 8 \
--use_deepspeed \
--deepspeed_config_file configs/ds_configs/stage3_no_offloading_accelerate.conf \
--deepspeed_multinode_launcher standard \
open_instruct/finetune.py \
--exp_name olmo2_1b_sft \
--model_name_or_path allenai/OLMo-2-0425-1B \
--model_revision main \
--tokenizer_name allenai/OLMo-2-1124-7B \
--tokenizer_revision main \
--use_slow_tokenizer False \
--add_bos \
--dataset_mixer_list allenai/tulu-3-sft-olmo-2-mixture-0225 1.0 \
--use_flash_attn \
--max_seq_length 4096 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--learning_rate 2e-5 \
--lr_scheduler_type linear \
--warmup_ratio 0.03 \
--weight_decay 0.0 \
--num_train_epochs 2 \
--reduce_loss sum \
--report_to wandb \
--with_tracking \
--logging_steps 1 \
--seed 8
I'll improve the docs soon (like this: https://github.com/allenai/open-instruct/blob/main/docs/tulu3.md)
How did you evaluate your models?
Thank you for the reply, I will try this setting. And hope for you docs. By the way, I use the olmes for evaluation.
Can you say more @RobinsonKO ? I bet the OLMES repo has become outdated to our setup a bit (you can see its not updated often), so it'll be hard to track down. I am adding the exact commands used to train the models in #703