poster icon indicating copy to clipboard operation
poster copied to clipboard

PDF composed of images of scans causes grobid to fail (add error handling)

Open cindywu opened this issue 6 years ago • 4 comments

schemenauer1994.pdf

cindywu avatar Dec 03 '19 20:12 cindywu

This looks like a grobid issue:

MultiXml::ParseError (1:1: FATAL: Start tag expected, '<' not found):

dennyluan avatar Dec 07 '19 08:12 dennyluan

This is because of https://github.com/kermitt2/grobid/issues/132, the PDF file is composed of images of scans and has no text to parse, so grobid fails.

dennyluan avatar Dec 08 '19 21:12 dennyluan

for all PDFs that are images of scans, we could create an error message and say we do not yet support parsing text from images of scans

cindywu avatar Dec 08 '19 23:12 cindywu

https://linear.app/issue/JEL-36/grobid-fails-when-pdf-is-composed-of-images

cindywu avatar May 11 '20 00:05 cindywu