datachain icon indicating copy to clipboard operation
datachain copied to clipboard

DCLM: dedup, model-based filtering, eval

Open shcheklein opened this issue 1 year ago • 0 comments

Context: https://arxiv.org/pdf/2406.11794 and https://www.datacomp.ai/dclm/

DCLM download is covered. Next steps are:

  • [ ] research how they do deduplication
  • [ ] apply model based - filtering
  • [ ] quick way to do eval - filter -> eval -> create better model for filtering -> eval - that's key here I think - quick way to do some operations on a data sample and see if that helps. (or alternatively apply at scale - at least on the small size and make sure it is fast enough)

shcheklein avatar Nov 29 '24 00:11 shcheklein