dolma icon indicating copy to clipboard operation
dolma copied to clipboard

Is there explicitly instruction-following data in the version of Dolma used to train Olmo v1?

Open john-hewitt opened this issue 7 months ago • 3 comments

Hi everyone,

I'm working on a research project relating to instruction following, and it would be amazing to have a language model with a guarantee that no explicitly instruction-following data (e.g., from LIMA, or Alpaca, etc. etc.,) was used during pretraining.

Some thoughts:

  • I don't have the disk space to build Dolma. Alas!
  • I realize that in Dolma v1.7, FLAN is explicitly included, so that's out.
  • I say "explicitly" instruction-following data because lots of naturally occuring web data have instruction-response-like formats (stackoverflow, etc) -- this is fine; I'm just worried about the increasingly common practice of mixing in explicit "instruction following SFT data" in the pretraining process.
  • I know there's an n-gram viewer at https://wimbd.apps.allenai.org/about, and it says the TULU-style <|assistant|> n-gram shows up around 20M times, but... it returns the identical answer for assistant without the formatting, so I imagine it's stripping the formatting, so this isn't useful.

I realize data can leak in, so the answer is probably not "definitely not" but does anyone know if the answer is at least "not intentionally"?

See corresponding Olmo request; wasn't sure how much information sharing there would be between the two: https://github.com/allenai/OLMo/issues/658

Thanks!

john-hewitt avatar Jul 15 '24 23:07 john-hewitt