Search icon indicating copy to clipboard operation
Search copied to clipboard

Alternatives to GROBID (PDF parsing)

Open jankrepl opened this issue 4 years ago • 4 comments

Are there any alternatives to GROBID and would there be any major advantages in using them?

Alternatives (feel free to add new entries)

  • https://github.com/pdfminer/pdfminer.six
  • https://github.com/mstamy2/PyPDF2
  • https://github.com/pymupdf/PyMuPDF

Other links

Comments

If we go for a pure Python solution there might not be need for intermediary formats (i.e. TEI XML for GROBID)

jankrepl avatar Oct 20 '21 12:10 jankrepl

I think that having a benchmark of various possible solutions is a good idea.

I also agree that using GROBID creates some complications:

  • the output is an intermediary format
  • you need to docker pull a GROBID image to run the server – but how do we track the version of the GROBID server running?
  • instead of directly calling a function, we need to send requests to a server, which may be seen as an unnecessary complication

But maybe for the moment, we can wait to see some failure cases of GROBID on our articles before thinking about alternatives. In the end GROBID seems to be a well-established solution, used e.g. by the creator of CORD-19. What do you think @jankrepl ?

FrancescoCasalegno avatar Oct 22 '21 12:10 FrancescoCasalegno

Also, I had a look at the paper they used in that blog post for their benchmark: https://schoolshooters.info/sites/default/files/2014-NaBITA-Whitepaper-Text-with-Graphics.pdf

I think it looks a bit simple (was it written in Google Docs/Word and then saved as PDF?) compared to other two-column articles with lots of figures and tables generated with LaTeX like the ones we have to deal with.

So when we want to run this benchmark I think we should test also on different kinds of papers.

FrancescoCasalegno avatar Oct 22 '21 12:10 FrancescoCasalegno

Small side note related to this: GROBID is saving the version used to convert the PDF to TEI XML in the xml file (see here).

EmilieDel avatar Oct 22 '21 14:10 EmilieDel

As an alternative to GROBID, there is the solution here, developed in the context of OpenMinTeD.

The extracted text could be accessed through document_text here.

pafonta avatar Oct 22 '21 17:10 pafonta