ray icon indicating copy to clipboard operation
ray copied to clipboard

Add WarcDatasource for reading WARC/ARC files

Open ryan-minato opened this issue 1 year ago • 0 comments

A WarcDatasource has been added to facilitate the reading of WARC/ARC data types, to access files from Common Crawl.

Why are these changes needed?

In cleaning of pre-training data for LLM, Ray Data is nearly the only distributed solution (Dask appears to be less suitable for this task). However, I noticed that Ray Data don't have a convenient way to access data from Common Crawl. Adding a Datasource for reading WARC/ARC data would be helpful.

Related issue number

#45535

ryan-minato avatar May 24 '24 02:05 ryan-minato