Alternatives to GROBID (PDF parsing)
Are there any alternatives to GROBID and would there be any major advantages in using them?
Alternatives (feel free to add new entries)
- https://github.com/pdfminer/pdfminer.six
- https://github.com/mstamy2/PyPDF2
- https://github.com/pymupdf/PyMuPDF
Other links
Comments
If we go for a pure Python solution there might not be need for intermediary formats (i.e. TEI XML for GROBID)
I think that having a benchmark of various possible solutions is a good idea.
I also agree that using GROBID creates some complications:
- the output is an intermediary format
- you need to
docker pullaGROBIDimage to run the server – but how do we track the version of theGROBIDserver running? - instead of directly calling a function, we need to send requests to a server, which may be seen as an unnecessary complication
But maybe for the moment, we can wait to see some failure cases of GROBID on our articles before thinking about alternatives. In the end GROBID seems to be a well-established solution, used e.g. by the creator of CORD-19.
What do you think @jankrepl ?
Also, I had a look at the paper they used in that blog post for their benchmark: https://schoolshooters.info/sites/default/files/2014-NaBITA-Whitepaper-Text-with-Graphics.pdf
I think it looks a bit simple (was it written in Google Docs/Word and then saved as PDF?) compared to other two-column articles with lots of figures and tables generated with LaTeX like the ones we have to deal with.
So when we want to run this benchmark I think we should test also on different kinds of papers.
Small side note related to this: GROBID is saving the version used to convert the PDF to TEI XML in the xml file (see here).
As an alternative to GROBID, there is the solution here, developed in the context of OpenMinTeD.
The extracted text could be accessed through document_text here.