grobid icon indicating copy to clipboard operation
grobid copied to clipboard

Grobid consistently drops characters, e.g., "fi", "ff"

Open mfeblowitz opened this issue 2 years ago • 8 comments

I am using grobid (0.6.1 and 0.7.0, ubuntu 18.04) to extract the content of pdf files into html format. (Separate from grobid, I further extract paragraph content for question answering).

I have noticed several cases where a pair of characters are being replaced by a space. Images below show the pdf document and the resultant extracted html content - direct output from grobid - where the changes have occurred.

For example, characters in the extracted paragraphs, e.g., "fi" in "financial", are replaced by a space. One example is in https://www.americanprogress.org/article/economic-impact-coronavirus-united-states-possible-economic-policy-responses/ - verified using the grobid web app TEI tab (so, independent of any code I've written).

See, for example, what happens to the original:

grobid_droppage_orig

As reflected in the extracted html:

grobid_droppage

This is doing a number on our NLP/NLU processing of these documents.

Any suggestions adjustments?

mfeblowitz avatar Feb 16 '22 23:02 mfeblowitz

Hello @mfeblowitz !

I guess in this case you are producing a PDF from the HTML page, correct? With which tool are you generating this PDF?

Apparently with Firefox/Linux, the generated PDF uses embedded fonts for the ligatures (ff, fi), so the unicode for these glyphs is not the correct unicode of these characters, but an index to the glyph in the local fonts.

For instance if you cut and paste the text from this PDF with a basic PDF viewer, or with pdftotext command line:

can have disruptive e�ects on the economy.
... making it harder for U.S. �rms to �ll orders...

This is unfortunately very frequent in PDF, particularly for scientific text with a lot of special characters not associated with the right unicode but just to an embedded glyph. There's nothing magic in Grobid for the moment to "guess" the valid unicode of embedded fonts. We started to work on an custom OCR solution in pdfalto to recover the right unicode, but it's a lot of work.

What you could do is to try to mitigate this issue at the level of html-to-pdf tool or try to change the font in the HTML before generating the PDF, so that the PDF contains a more standard font.

kermitt2 avatar Feb 17 '22 03:02 kermitt2

Um, no. Sorry to have not been clear. Updating the description. I'm pulling the pdfs from the web and extracting from them. Thus, I have no control of the production of the pdfs.

mfeblowitz avatar Feb 17 '22 14:02 mfeblowitz

Do you have an example of such PDF? Where does it come form? because this article seems originally in html first.

The problem applies similarly to native PDF using embedded fonts for ligatures, but it's somehow worse because there is no solution upstream, except using an OCR, which might degrade other aspects of the document.

kermitt2 avatar Feb 17 '22 15:02 kermitt2

Interesting... The origin of the pdf document (linked above) was the product of saving that web page to a pdf file. The contents are (mostly) binary. And pdftotext indeed revealed the same behavior. On a hunch, I tried the print to pdf using firefox rather that the "export to pdf" or "print... save as to pdf" in safari. Firefox did the right thing. So I do have control over which source (of pdf documents) to use!

mfeblowitz avatar Feb 17 '22 15:02 mfeblowitz

Now, if only there was a way to be alerted when the ligature substitution might have occurred, so excruciating manual examination of all processed documents would not be required...

mfeblowitz avatar Feb 17 '22 16:02 mfeblowitz

mmm check if "fi" "ff" occurs in the text of not? At least it would cover the ligature case, but the embedded font issue can happen for many characters in general.

kermitt2 avatar Feb 17 '22 16:02 kermitt2

That's the rub. To know whether it has the characters, you'd need a good extraction to compare against.

Or you'd need a comprehensive (huge) set of patterns to look for in the bad text: "e ect" for effect, " nance" or " nancial" for finance or financial, ...

Maybe use some machine learning to learn the patterns.

Or maybe some nlp to detect nonsense sentences...

mfeblowitz avatar Feb 17 '22 23:02 mfeblowitz

For info I've worked on an OCR quality scorer to detect documents with noise coming from terrible OCR (like OCR of the nineties), so that the document can be filtered out or re-OCR with a modern OCR. It might be possible to apply it to your use case, as the "non sense" texts due to the destructive HTML to PDF conversion might be considered to lower the quality score of the converted document. It's based on a DL language model applied to chunks of an input document, then normalized with an XGBoost model.

https://github.com/science-miner/ocr_scorer

kermitt2 avatar Apr 13 '22 13:04 kermitt2