Search Alternatives to GROBID (PDF parsing)

Are there any alternatives to GROBID and would there be any major advantages in using them?

Alternatives (feel free to add new entries)

https://github.com/pdfminer/pdfminer.six
https://github.com/mstamy2/PyPDF2
https://github.com/pymupdf/PyMuPDF

Comments

If we go for a pure Python solution there might not be need for intermediary formats (i.e. TEI XML for GROBID)

Oct 20 '21 12:10 jankrepl

I think that having a benchmark of various possible solutions is a good idea.

I also agree that using GROBID creates some complications:

the output is an intermediary format
you need to docker pull a GROBID image to run the server – but how do we track the version of the GROBID server running?
instead of directly calling a function, we need to send requests to a server, which may be seen as an unnecessary complication

But maybe for the moment, we can wait to see some failure cases of GROBID on our articles before thinking about alternatives. In the end GROBID seems to be a well-established solution, used e.g. by the creator of CORD-19. What do you think @jankrepl ?

Oct 22 '21 12:10 FrancescoCasalegno

Also, I had a look at the paper they used in that blog post for their benchmark: https://schoolshooters.info/sites/default/files/2014-NaBITA-Whitepaper-Text-with-Graphics.pdf

I think it looks a bit simple (was it written in Google Docs/Word and then saved as PDF?) compared to other two-column articles with lots of figures and tables generated with LaTeX like the ones we have to deal with.

So when we want to run this benchmark I think we should test also on different kinds of papers.

Oct 22 '21 12:10 FrancescoCasalegno

Small side note related to this: GROBID is saving the version used to convert the PDF to TEI XML in the xml file (see here).

Oct 22 '21 14:10 EmilieDel

As an alternative to GROBID, there is the solution here, developed in the context of OpenMinTeD.

The extracted text could be accessed through document_text here.

Oct 22 '21 17:10 pafonta

Alternatives to GROBID (PDF parsing)

Alternatives (feel free to add new entries)

Other links

Comments