ray
ray copied to clipboard
Add WarcDatasource for reading WARC/ARC files
A WarcDatasource has been added to facilitate the reading of WARC/ARC data types, to access files from Common Crawl.
Why are these changes needed?
In cleaning of pre-training data for LLM, Ray Data is nearly the only distributed solution (Dask appears to be less suitable for this task). However, I noticed that Ray Data don't have a convenient way to access data from Common Crawl. Adding a Datasource for reading WARC/ARC data would be helpful.
Related issue number
#45535