benchmarks icon indicating copy to clipboard operation
benchmarks copied to clipboard

MosaicBERT: pretraining configuration for models > 128 seq. length

Open stefan-it opened this issue 1 year ago • 5 comments

Hi MosaicML team,

many thanks for releasing the code and models for your MosaicBERT! I highly appreciate the effort that you put in modernizing the BERT architecture.

I am interested in pretraining MosaicBERT so I have some questions :)

  • I am interested in the pretraining configuration for the model with 512 sequence length. Additionally: do you have hardware recommendations and the approx. time to pretrain MosaicBERT with 512 seq. length. Did you use the phase 1 + phase 2 "trick" with pretraining for 128 seq. length and then fewer steps with 512? For that, the MosaicBERT with 128 seq. length could be "recycled".
  • I'm also interested in what implementation is recommended to use e.g. a tagged/specific commit or the upcoming #440 PR.

Many thanks in advance!

Stefan

stefan-it avatar Jan 03 '24 11:01 stefan-it

@stefan-it - I tried the commit in main, and ran into a number of errors, and was pointed to #440, so I am planning on basing my work on that unless I hear otherwise.

Taytay avatar Jan 04 '24 17:01 Taytay

Hi @stefan-it we did not experiment with training on 128 then switching to 512 (as in the original BERT paper by Devlin et al. 2018). In our experiments, training MosaicBERT-Base on sequence length 512 with batch size 4096 and 70,000 steps took roughly 30 hours on 8 A100 80 GB GPUs (see below).

It might take us a few more days to merge the FA2 PR #440, but do let us know if you run into any issues!

image

jacobfulano avatar Jan 05 '24 20:01 jacobfulano

Hi @jacobfulano, do you also have an estimate for how long it will take to pre-train MosaicBERT-Large on a sequence length of 512 with batch size 4096 for 70,000 steps?

mmarius avatar Jan 06 '24 10:01 mmarius

Hi @mmarius, we did not specifically train MosaicBERT-Large with sequence length 512 with batch size 4096 for 70,000 steps. However my estimate would be roughly 4x the time it takes to train MosaicBERT-Large with sequence length 128 with batch size 4096 for 70,000 (~27.2 hours). So roughly 108 hours on 8 A100 80GB GPUs

jacobfulano avatar Jan 08 '24 22:01 jacobfulano

If you are going any larger than that I would recommend looking at the mosaicml/llm-foundry which should have support for training encoders/embedding models soon.

jacobfulano avatar Jan 08 '24 22:01 jacobfulano