RedPajama-Data
RedPajama-Data copied to clipboard
Specifying arxiv dates
Hi there,
Thanks for making this code available. I am trying to use the arxiv downloader, but would be interested in a certain date range of papers to be downloaded. Any tips on how to approach this?
Many thanks
Hi @matthieumeeus
The arxiv data on their S3 bucket follows the format arXiv_src_<month>_<chunk>.tar
(e.g., arXiv_src_1206_004.tar
corresponds to the chunk 4 of the month 2012-06
). If months-level granularity is fine enough you can run e.g.
python run_download.py --aws_config aws_config.ini --workers 1 --target_dir $DATA_DIR --setup
which will produce a file with all the listings in it. You can either filter this generated file, or directly change to code here to only include the yymm
tags that you are interested in.
I hope this helps!