arxiv-sanity-preserver icon indicating copy to clipboard operation
arxiv-sanity-preserver copied to clipboard

Support fetching from kaggle dataset

Open Randl opened this issue 3 years ago • 5 comments

This pull request adds functionality of fetching meata-data and pdfs from kaggle arxiv dataset without overloading arxiv API and allowing faster bootstrap. gsutil ls gs://arxiv-dataset/arxiv/arxiv/pdf can be used to download all pdfs (>2 TB of data).

Additional changes are made to handle subfolders in pdf folder and faster processing for large amount of db entries.

Randl avatar May 18 '21 09:05 Randl

@Randl

This pull request adds functionality of fetching meata-data and pdfs from kaggle arxiv dataset without overloading arxiv API and allowing faster bootstrap. gsutil ls gs://arxiv-dataset/arxiv/arxiv/pdf can be used to download all pdfs (>2 TB of data).

Additional changes are made to handle subfolders in pdf folder and faster processing for large amount of db entries.

Question: Does this support newer arxiv papers that are not in the Kaggle dataset?

Devetec avatar Jul 05 '21 03:07 Devetec

The standard fetching is still available, it's just that you'll need to download only a couple of hundreds of newest papers from API

Randl avatar Jul 05 '21 04:07 Randl

This is interesting, I didn't see this earlier and developed something on my own too. But what you have done is definitely a lot neater.

My only immediate concern while developing it was that the fields were slightly different and I needed some transformations. As much as possible, I fitted it into the db.p dictionary structure to minimise any downstream bugs.

But looking forward to the review of whether this can become part of the main branch. The rate limiters and blocks on arXiv is quite painful.

lance10t avatar Jul 05 '21 07:07 lance10t

it seems that @karpathy is rewriting it (from scratch?): https://www.reddit.com/r/MachineLearning/comments/obne9p/d_is_arxivsanity_down_what_people_use_these_days/h3q422o Hopefully it'll be part of the features of new version.

Randl avatar Jul 05 '21 08:07 Randl

This is a great initiative. One of the best projects I've come across that solves a practical need.

lance10t avatar Jul 05 '21 08:07 lance10t