ray icon indicating copy to clipboard operation
ray copied to clipboard

[Data] Add WarcDatasource for reading WARC/ARC files

Open ryan-minato opened this issue 1 year ago • 1 comments

Description

Add a Datasource for reading data from WARC/ARC files.

Use case

In cleaning of pre-training data for LLM, Ray Data is nearly the only distributed solution (Dask appears to be less suitable for this task). However, I noticed that Ray Data don't have a convenient way to access data from Common Crawl. Adding a Datasource for reading WARC/ARC data would be helpful.

ryan-minato avatar May 24 '24 02:05 ryan-minato

I have extracted the part that reads WARC files from the newly released Datatrove framework on Huggingface and created a Datasource as a reference.

Unfortunately, I don't have the time to handle the integration and CI pipeline.

#45536

ryan-minato avatar May 24 '24 02:05 ryan-minato