arxiv-sanity-preserver
arxiv-sanity-preserver copied to clipboard
Support fetching from kaggle dataset
This pull request adds functionality of fetching meata-data and pdfs from kaggle arxiv dataset without overloading arxiv API and allowing faster bootstrap. gsutil ls gs://arxiv-dataset/arxiv/arxiv/pdf
can be used to download all pdfs (>2 TB of data).
Additional changes are made to handle subfolders in pdf folder and faster processing for large amount of db entries.
@Randl
This pull request adds functionality of fetching meata-data and pdfs from kaggle arxiv dataset without overloading arxiv API and allowing faster bootstrap.
gsutil ls gs://arxiv-dataset/arxiv/arxiv/pdf
can be used to download all pdfs (>2 TB of data).Additional changes are made to handle subfolders in pdf folder and faster processing for large amount of db entries.
Question: Does this support newer arxiv papers that are not in the Kaggle dataset?
The standard fetching is still available, it's just that you'll need to download only a couple of hundreds of newest papers from API
This is interesting, I didn't see this earlier and developed something on my own too. But what you have done is definitely a lot neater.
My only immediate concern while developing it was that the fields were slightly different and I needed some transformations. As much as possible, I fitted it into the db.p dictionary structure to minimise any downstream bugs.
But looking forward to the review of whether this can become part of the main branch. The rate limiters and blocks on arXiv is quite painful.
it seems that @karpathy is rewriting it (from scratch?): https://www.reddit.com/r/MachineLearning/comments/obne9p/d_is_arxivsanity_down_what_people_use_these_days/h3q422o Hopefully it'll be part of the features of new version.
This is a great initiative. One of the best projects I've come across that solves a practical need.