OLMo Initial Loss increased from 10 (0.3.0 v) to 60 (0.4.0) !

🐛 Describe the bug

There is a significant discrepancy in the initial loss values between different versions of olmo and the presence or absence of the step-738020 checkpoint. This suggests potential issues with the model initialization or checkpoint handling in version 0.4.0. I believe the following results can be reproduced, since this bug has costed me for a week.

Task:

Training from scratch / fine-tuning on BIoMed

Results

olmo v0.4.0 : w/ step-738020 ckpt -- intial loss is 71
olmo v0.4.0 : w/o step-738020 ckpt -- intial loss is 32
olmo v0.3.0 : w/ step-738020 ckpt -- intial loss is 2
olmo v0.3.0 : w/o step-738020 ckpt -- intial loss is 11

W B Chart 2024_7_29 22_12_01

Versions

Build from source

olmo v0.4.0
olmo v0.3.0

Jul 29 '24 14:07 Xuekai-Zhu

Not only in BIoMed data. The same results in your provided data.

https://olmo-data.org/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-000-00000.npy

Jul 29 '24 14:07 Xuekai-Zhu

@Xuekai-Zhu Can you say more on what you mean by the "presence or absence" of that checkpoint? And can you share the code you're using for loading?

Jul 29 '24 16:07 AkshitaB

I use the following command to run OLMo without modifying the source code. Therefore, the default code for loading OLMo ckpt v0.3.0 and v0.4.0 is used.

torchrun --nproc_per_node=4 --master_port=29216 OLMo/scripts/train.py config/bio/OLMo-1B.yaml \
    --save_overwrite \
    --reset_trainer_state \
    --load_path=https://olmo-checkpoints.org/ai2-llm/olmo-small/g4g72enr/step738020-unsharded/

Jul 30 '24 02:07 Xuekai-Zhu

@Xuekai-Zhu Can you say more on what you mean by the "presence or absence" of that checkpoint? And can you share the code you're using for loading?

you can directly refer to the above image.

Using or not using a pretrained checkpoint, as well as the version of OLMo, can result in different initial loss values.

Jul 30 '24 03:07 Xuekai-Zhu

Loosely speaking, v0.3.0 produces correct loss values, but the loss values in v0.4.0 are incorrect. using the pretrained checkpoint results in even higher loss values, which is clearly an error. There seems to be an issue with the loss calculation in the v0.4.0 code. Due to the significant changes in this version, it's difficult for me to compare the two. Could you please take a look?

Jul 30 '24 03:07 Xuekai-Zhu

Since you are building from source, it's possible that you were affected by the bug that was fixed in https://github.com/allenai/OLMo/pull/680. Could you pull the commit and see if that fixes your issue?

Jul 30 '24 05:07 2015aroras

I am seeing your issue locally now and it is not fixed by #680. I am investigating

Jul 30 '24 18:07 2015aroras

Thank you very much! I think this might be a rather urgent bug since it leads to training errors. Reverting to the 0.3.0 version work for me now.

Jul 31 '24 08:07 Xuekai-Zhu

Upon further investigation, instances of bad loss we observed outside of #680 were due to bad setup (bad container or incorrect config).

In particular, I ran from a checkpoint while passing --force_save_unsharded --dry_run in order to get the model loaded into code and saved, without any training. Then I ran scripts/compare_model_state.py with the original checkpoint and the new checkpoint and saw that they were different. This suggested that something was corrupting model state before training even started. When doing the above in a healthy container, I saw no difference between the 2 checkpoints.

If you find out what's causing the issue for you in 0.4.0, please let us know. We will also update here if we run into the issue again.

Aug 06 '24 16:08 2015aroras

I'll close this, but feel free to re-open if you find you still have the problem.

Nov 08 '24 18:11 dirkgr