dolma
dolma copied to clipboard
Inquiry about Web Pipeline Availability
I hope you are doing well. I came across a reference to the "Web Pipeline" in the paper "Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research" and I am very interested in exploring it further. However, it seems that the pipeline is still in preparation. I would like to kindly inquire about the availability of the "Web Pipeline". Is there any information on when it might be released for public use?
Hi @codefly13 - all of it is already available in the dolma toolkit (i.e. this repo). Please let me know if you're looking for something different.
@dumitrac I'm interested in this as well. I'd like to utilize the Dolma toolkit to perform some filtering on CC data (which is what I assume @codefly13 was attempting to perform as well). However, I don't see an example of how to do this in the repo, and the following pipeline is just marked as being WIP: https://github.com/allenai/dolma/tree/main/sources/cc_warc
I'm very new to Dolma so there's a good chance I'm just missing something. Would appreciate some pointers. Thanks!