dolma
dolma copied to clipboard
Is there explicitly instruction-following data in the version of Dolma used to train Olmo v1?
Hi everyone,
I'm working on a research project relating to instruction following, and it would be amazing to have a language model with a guarantee that no explicitly instruction-following data (e.g., from LIMA, or Alpaca, etc. etc.,) was used during pretraining.
Some thoughts:
- I don't have the disk space to build Dolma. Alas!
- I realize that in Dolma v1.7, FLAN is explicitly included, so that's out.
- I say "explicitly" instruction-following data because lots of naturally occuring web data have instruction-response-like formats (stackoverflow, etc) -- this is fine; I'm just worried about the increasingly common practice of mixing in explicit "instruction following SFT data" in the pretraining process.
- I know there's an n-gram viewer at https://wimbd.apps.allenai.org/about, and it says the TULU-style
<|assistant|>
n-gram shows up around 20M times, but... it returns the identical answer forassistant
without the formatting, so I imagine it's stripping the formatting, so this isn't useful.
I realize data can leak in, so the answer is probably not "definitely not" but does anyone know if the answer is at least "not intentionally"?
See corresponding Olmo request; wasn't sure how much information sharing there would be between the two: https://github.com/allenai/OLMo/issues/658
Thanks!