ModernBERT How to run pre-training?

Thank you for the great work! I have a question regarding pre-training. Could you please clarify which YAML configuration file should be used to achieve a similar pre-training setup as ModernBert, but for a different language? I noticed that in the yamls folder, there doesn’t seem to be a specific file for this purpose. The only related script I found is generate_eval_config.py, which, if I understand correctly, generates a YAML configuration using ModernBert’s training params. Is my understanding correct, or am I missing something?

Dec 23 '24 10:12 BlessedTatonka

Was wondering about this, too!

Dec 25 '24 12:12 yzimmermann

same question

Dec 26 '24 02:12 chaofan520

same question. Are there any examples provided?

Jan 08 '25 05:01 GithubX-F

Hello, Sorry for the delayed response. We plan to write proper guides to run the pretraining in the next days, we have been a bit short on time lately. In the mean time, I have dropped the configs for the first step of pretraining (warmup+stable phases) here if you want to give it a shot before we clean up everything. The ones for context extension and decay will be added shortly.

Note that the data path should point to data folder in the MDS format, you have and example with the C4 dataset here. Also note that, to use streaming: True, you might need to decompress the data using this script. Disabling streaming makes pretraining faster and solves an issue with uneven GPUs memory allocation (see #85).

Again, sorry for the delay and hopefully we'll have better documentation soon.

Jan 09 '25 09:01 NohTow

@NohTow Hi there! Any update on a proper step-by-step guide for pretraining?

Feb 04 '25 13:02 BramVanroy

Hello,

Until we update the readmes and merge the configs, the above comment is the closest thing to a step-by-step guide. I agree that this is not optimal for now, and I again apologies for the delay, but could you specify what information are you lacking w.r.t the comment so we can add it to the readme? Thanks!

Edit: actually, I forgot but #183 that adds a bit of documentation to the main readme has been merged, so besides merging the configs, is there anything you are missing?

Feb 04 '25 14:02 NohTow

Hi @NohTow, following your guide, I am encountering an issue in pre-training. Could you help with #199 ?

Feb 13 '25 09:02 ebrarkiziloglu