Initial Loss increased from 10 (0.3.0 v) to 60 (0.4.0) !
🐛 Describe the bug
There is a significant discrepancy in the initial loss values between different versions of olmo and the presence or absence of the step-738020 checkpoint. This suggests potential issues with the model initialization or checkpoint handling in version 0.4.0. I believe the following results can be reproduced, since this bug has costed me for a week.
Task:
- Training from scratch / fine-tuning on BIoMed
Results
-
olmo v0.4.0 : w/ step-738020 ckpt -- intial loss is 71
-
olmo v0.4.0 : w/o step-738020 ckpt -- intial loss is 32
-
olmo v0.3.0 : w/ step-738020 ckpt -- intial loss is 2
-
olmo v0.3.0 : w/o step-738020 ckpt -- intial loss is 11
Versions
Build from source
- olmo v0.4.0
- olmo v0.3.0
Not only in BIoMed data. The same results in your provided data.
- https://olmo-data.org/preprocessed/olmo-mix/v1_5/gpt-neox-20b-pii-special/part-000-00000.npy
@Xuekai-Zhu Can you say more on what you mean by the "presence or absence" of that checkpoint? And can you share the code you're using for loading?
I use the following command to run OLMo without modifying the source code. Therefore, the default code for loading OLMo ckpt v0.3.0 and v0.4.0 is used.
torchrun --nproc_per_node=4 --master_port=29216 OLMo/scripts/train.py config/bio/OLMo-1B.yaml \
--save_overwrite \
--reset_trainer_state \
--load_path=https://olmo-checkpoints.org/ai2-llm/olmo-small/g4g72enr/step738020-unsharded/
@Xuekai-Zhu Can you say more on what you mean by the "presence or absence" of that checkpoint? And can you share the code you're using for loading?
you can directly refer to the above image.
Using or not using a pretrained checkpoint, as well as the version of OLMo, can result in different initial loss values.
Loosely speaking, v0.3.0 produces correct loss values, but the loss values in v0.4.0 are incorrect. using the pretrained checkpoint results in even higher loss values, which is clearly an error. There seems to be an issue with the loss calculation in the v0.4.0 code. Due to the significant changes in this version, it's difficult for me to compare the two. Could you please take a look?
Since you are building from source, it's possible that you were affected by the bug that was fixed in https://github.com/allenai/OLMo/pull/680. Could you pull the commit and see if that fixes your issue?
I am seeing your issue locally now and it is not fixed by #680. I am investigating
Thank you very much! I think this might be a rather urgent bug since it leads to training errors. Reverting to the 0.3.0 version work for me now.
Upon further investigation, instances of bad loss we observed outside of #680 were due to bad setup (bad container or incorrect config).
In particular, I ran from a checkpoint while passing --force_save_unsharded --dry_run in order to get the model loaded into code and saved, without any training. Then I ran scripts/compare_model_state.py with the original checkpoint and the new checkpoint and saw that they were different. This suggested that something was corrupting model state before training even started. When doing the above in a healthy container, I saw no difference between the 2 checkpoints.
If you find out what's causing the issue for you in 0.4.0, please let us know. We will also update here if we run into the issue again.
I'll close this, but feel free to re-open if you find you still have the problem.