Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Arxiv: Research papers

Open SkanderHellal opened this issue 1 year ago • 3 comments

I would like to contribute to the project by extracting data from Arxiv.

I would like to extract titles and abstracts or other metadata that might be helpul.

I think extracting the whole research paper text is not obvious as we cannot control the text length and we should extract text using OCR or other techniques. Therefore, I would like to start with titles and abstracts extractions and I will think if we can also extract figures and tables in a futher step.

what do you think of the approach?

SkanderHellal avatar Mar 15 '23 11:03 SkanderHellal

I think this might be a good idea. Here's something I put together for OCRing random PDF files for Mick last week:

https://github.com/bitplane/ocr-pdf

It's a bit raw and hasn't been used in anger, but it's for a similar idea. Feel free to use the code :)

bitplane avatar Mar 15 '23 18:03 bitplane

This might be a possible duplicate: https://github.com/LAION-AI/Open-Assistant/issues/1927

Also note that tools like pdfplumber or textract can be used for this task

CloseChoice avatar Mar 26 '23 19:03 CloseChoice

Related: https://www.kaggle.com/datasets/Cornell-University/arxiv

Miserlou avatar Apr 12 '23 21:04 Miserlou