Open-Assistant Arxiv: Research papers

Arxiv: Research papers

Open SkanderHellal opened this issue 2 years ago • 3 comments

I would like to contribute to the project by extracting data from Arxiv.

I would like to extract titles and abstracts or other metadata that might be helpul.

I think extracting the whole research paper text is not obvious as we cannot control the text length and we should extract text using OCR or other techniques. Therefore, I would like to start with titles and abstracts extractions and I will think if we can also extract figures and tables in a futher step.

what do you think of the approach?