datachain
datachain copied to clipboard
DCLM: dedup, model-based filtering, eval
Context: https://arxiv.org/pdf/2406.11794 and https://www.datacomp.ai/dclm/
DCLM download is covered. Next steps are:
- [ ] research how they do deduplication
- [ ] apply model based - filtering
- [ ] quick way to do eval - filter -> eval -> create better model for filtering -> eval - that's key here I think - quick way to do some operations on a data sample and see if that helps. (or alternatively apply at scale - at least on the small size and make sure it is fast enough)