ray
ray copied to clipboard
[Data] Add WarcDatasource for reading WARC/ARC files
Description
Add a Datasource for reading data from WARC/ARC files.
Use case
In cleaning of pre-training data for LLM, Ray Data is nearly the only distributed solution (Dask appears to be less suitable for this task). However, I noticed that Ray Data don't have a convenient way to access data from Common Crawl. Adding a Datasource for reading WARC/ARC data would be helpful.
I have extracted the part that reads WARC files from the newly released Datatrove framework on Huggingface and created a Datasource as a reference.
Unfortunately, I don't have the time to handle the integration and CI pipeline.
#45536