benchmarks MosaicBERT: pretraining configuration for models

MosaicBERT: pretraining configuration for models > 128 seq. length

Open stefan-it opened this issue 1 year ago • 5 comments

Hi MosaicML team,

many thanks for releasing the code and models for your MosaicBERT! I highly appreciate the effort that you put in modernizing the BERT architecture.

I am interested in pretraining MosaicBERT so I have some questions :)

I am interested in the pretraining configuration for the model with 512 sequence length. Additionally: do you have hardware recommendations and the approx. time to pretrain MosaicBERT with 512 seq. length. Did you use the phase 1 + phase 2 "trick" with pretraining for 128 seq. length and then fewer steps with 512? For that, the MosaicBERT with 128 seq. length could be "recycled".
I'm also interested in what implementation is recommended to use e.g. a tagged/specific commit or the upcoming #440 PR.

Many thanks in advance!

Stefan

Jan 03 '24 11:01 stefan-it

@stefan-it - I tried the commit in main, and ran into a number of errors, and was pointed to #440, so I am planning on basing my work on that unless I hear otherwise.

Jan 04 '24 17:01 Taytay

Hi @stefan-it we did not experiment with training on 128 then switching to 512 (as in the original BERT paper by Devlin et al. 2018). In our experiments, training MosaicBERT-Base on sequence length 512 with batch size 4096 and 70,000 steps took roughly 30 hours on 8 A100 80 GB GPUs (see below).

It might take us a few more days to merge the FA2 PR #440, but do let us know if you run into any issues!

Jan 05 '24 20:01 jacobfulano

Hi @jacobfulano, do you also have an estimate for how long it will take to pre-train MosaicBERT-Large on a sequence length of 512 with batch size 4096 for 70,000 steps?

Jan 06 '24 10:01 mmarius

Hi @mmarius, we did not specifically train MosaicBERT-Large with sequence length 512 with batch size 4096 for 70,000 steps. However my estimate would be roughly 4x the time it takes to train MosaicBERT-Large with sequence length 128 with batch size 4096 for 70,000 (~27.2 hours). So roughly 108 hours on 8 A100 80GB GPUs

Jan 08 '24 22:01 jacobfulano

If you are going any larger than that I would recommend looking at the mosaicml/llm-foundry which should have support for training encoders/embedding models soon.

Jan 08 '24 22:01 jacobfulano

benchmarks benchmarks copied to clipboard

MosaicBERT: pretraining configuration for models > 128 seq. length

benchmarks
benchmarks copied to clipboard