LaTeXML
LaTeXML copied to clipboard
Bulk dump of ar5iv?
Hi,
TL;DR: possible to get a bulk dump of ax5iv?
I'm with Stanford's Center for Research on Foundation Models and we're looking into improving the state of data cleaning for training large language models (and maybe vision-language models)
One of the ways we think that open LLMs are behind closed is that the data preprocessing and cleaning is way behind what's done behind closed doors. One important source of data is arxiv, especially for scientific/math applications. Unfortunately, people typically just take the raw LaTeX and maybe do some very heuristic/incomplete things. For example, please see the now fairly widely used red pajama version of arxiv). In our opinion, using a better tool (like yours!) that standardizes the output (to html or markdown or something) would be much better.
We could of course download arxiv and run latexml over the whole thing, but that feels like a lot of effort to go to when you've already done it! I think we could do a requestor pays thing, if that's helpful.
Thanks!
Hi @dlwh
I have been trying to address this as far back as 2017, but it is really hard to gain visibility for datasets in small research labs. Luckily ar5iv and now arXiv's official HTML view solves that and we can get issue requests like yours!
This week we are just finishing a rerun over the latest arXiv with latexml v0.8.8 which will then get rsynced into ar5iv. I am in active internal talks to get those results distributed as a dataset again - and this time I plan to also pin an announcement on the ar5iv front page as well.
We actually used to do dataset packaging actively in 2017-2020, when the project name was arXMLiv, see the resource page here.
Would such an arrangement suit your needs? It is a bit overly careful to ensure all stakeholders are properly respected, including respecting the official article licensing. For which we have a restrictive research-only license and a manual step to gain access.
And yes, as you say - downloading arXiv and running latexml over it is something anyone can do on their own, but you will burn about ~5.4 CPU years to get there with our current settings. I assume that is what Meta did when they worked on Nougat
Also, this technically belongs in the ar5iv repository, but since I can't transfer it over, I will just leave it open for now and close when the dataset is public. Hopefully in April.
the data preprocessing and cleaning is way behind what's done behind closed doors.
I have an archived repository that did preprocessing over latexml's HTML called llamapun - and that name predates the recent llama models by a decade! :>
I've been refreshing the essential pieces of that for LLM preprocessing in a private repository, which I'll try to make public as well some time this year. But I also think it is more than welcome to have multiple different preprocessing toolkits over the HTML, since there are so many possible directions in which one can go. Sharing more for awareness than reuse at this point.
Thanks! Sorry about not noticing the ar5iv repo!
The arXMLiv dump is a great starting point for us, thank you!
Research-Only should be ok for us. If possible ultimately we'd like to get to a point where we could put something on HF, which could be behind a license gate, if you wanted.
Please let us know if we can help in any way, including just expressing our interest to powers-that-be.
And thanks for all the pointers!
Hi, @dginev . I recently did a similar processing to convert whole tex souce to xml/html format via LaTeXML.
Here is the statistic for my compiling result.
I saw your team also rerun the tex source upto 202402 via LaTeXML (The arXMLiv datasets)
I am now hard working on the Error/Fail part. The key point is to remove those unnecessary and unuseful latex commend like the standardize_\alpha and standardize_\beta I did so that I can bypass the undefine operation in LateXML.
Unlucky, those tex source in Error/Fail part are not dealt via simple tex cleaner. Do you have any idea about preprocessing/formating/standardize the tex source?
For example, i find in some case, error happened on those undefined math like $\EuScript{P}$. It wont influence the whole layout of the output. Maybe it better have an option that we can bypass the math parse entirely (keep it as $xxxxxx$ in plain text) , then render math in front.
@dlwh I am sorry to hear there is yet another run over arXiv with latexml, I know they are not cheap to perform.
The screenshot you've shared is from the CorTeX build page of our runs that build ar5iv. Indeed the numbers are current upto end of March 2024. And to the original topic of this issue - those assets are now packed and are almost ready for reuse (more on that soon).
For reducing the errors, the only "serious" answer I have is that this is a very big part of why I myself work on LaTeXML. Improvements to the conversion tool, which processes the document with an emulation of the TeX internals, are much more reliable than deploying a regex-based cleaner tool. As you can see from the CorTeX error report, there is a large breadth of possible failures - undefined macros are just one of many.
To answer with my "pragmatic" hat on, it really depends how you are planning on using the data. If you want to be safe, you can exclude any document with the Error severity entirely. On the slightly more permissive side, you can include those documents, chunk them into segments you find relevant (e.g. class ltx_para paragraphs) and then exclude any paragraphs which have the ltx_ERROR markup deposited inside. I had coded a variant of this approach in the past for my own preprocessing, seen here. In my experience it is both more reliable and more manageable to do HTML filters than TeX filters. Good luck!
I'm glad to hear that a packaged arXiv HTML source is about to be released, and I'm really looking forward to it.
Thank you also for your suggestion. It seems there's no better way for me to improve the pass rate of my TeX submissions from my end. Therefore, I will devote more effort to handling the non-TeX parts.
Hi everyone,
We have a dataset uploaded, and a homepage is now available: https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/
Since we also have the unusual affordance of an open github issue, I am sharing the link here first. Feel free to request a download link (you'd need to submit a license agreement which is friendly to research use). If all goes well we can close the issue soon, and I can advertise more widely.
My gratitude to @veya2ztn for testing, as he was the first downloader here. Seeing that our download process is healthy, I have announced the dataset release, and will close this issue.
The official link for ar5iv-04.2024 will continue to be:
https://sigmathling.kwarc.info/resources/ar5iv-dataset-2024/
Enjoy!
thank you so much! I really appreciate it!