Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

arxiv math effort

Open futurisold opened this issue 2 years ago • 4 comments

The idea is to start a data collection effort on arXiv for mathematics. We should be able to parse the papers for equations and proofs. I don't know (yet) how feasible it is or not to ground the dataset in formal mathematics (e.g. CoC, Lean), addressing limitations that Minerva had. We should also debate data structures and data format – I'm inclined towards both LaTeX and natural language (if applicable).

futurisold avatar Feb 05 '23 00:02 futurisold

@yk, please move this into backlog with the relevant labels and everything if you consider it useful.

futurisold avatar Feb 05 '23 00:02 futurisold

This is definitely a useful approach. YK or another with access will likely add it to the backlog soon but in the meantime feel free to take a look into it if you're interested in doing so!

olliestanley avatar Feb 05 '23 11:02 olliestanley

Hello,

Any ideas on the tools that can be used to parse paper? are they good enough in terms of performance?

SkanderHellal avatar Feb 05 '23 12:02 SkanderHellal

@SkanderHellal, I'm not that concerned about performance. I'm more concerned about how we store the data. A classical data structure is an AST. However, it's not at all clear how we should map LaTeX to AST. Another option is to follow the beaten track – what comes to mind is Meta's Galactica paper. Here's a relevant excerpt from the paper:

We use a modified version of the GROBID library for converting PDFs to text, as well as obtaining titles, authors and citations. Where mathematical LaTeX is available, for example in arXiv, we make sure to combine the GROBID results with LaTeX source to recover mathematical content. The final paper documents are stored in a markdown format, as opposed to full LaTeX. We use markdown as the standard format for all documents in the corpus to support knowledge blending between sources.

futurisold avatar Feb 05 '23 15:02 futurisold

Closing old data issue.

andreaskoepf avatar Jun 14 '23 09:06 andreaskoepf