ingest-file icon indicating copy to clipboard operation
ingest-file copied to clipboard

Review PDF parsing strategy

Open pudo opened this issue 4 years ago • 8 comments

What's broken?

  • We're seeing incorrect text extraction out of some documents, especially those containing Arabic text.
  • Text from images isn't being extracted into the right location in the remaining text
  • We have to maintain our own PDF binary bindings
  • We are extracting images in the documents to files first, then running OCR. Don't really need to put them on disk.

What are our options?

  • Continue with pdflib
  • Try out pdfreader - https://pdfreader.readthedocs.io/en/latest/tutorial.html#how-to-start
  • Try out pdfminer.six

pudo avatar Aug 04 '20 19:08 pudo

As far as I can tell, pdflib does not enable the proper extraction of text from a PDF with a multi-column layout, which is quite frequent for documents designed to be printed on paper, such as official documents from administrative authorities.

The output of pdflib is of the form:

line_1_col_A    line_1_col_B
line_2_col_A    line_2_col_B
...

@pudo suggested I had another look at pdfminer.six . It does support multi-column layouts out of the box (see https://github.com/pdfminer/pdfminer.six/issues/276 ). The parameters for layout analysis (https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/layout.py#L32) are exposed. In my test cases, I only needed to adjust the value for char_margin to recover the correct text structure. I cannot perform thorough tests on edge cases, but is it useful if I prepare a minimal PR that replaces pdflib with pdfminer.six ?

moreymat avatar Sep 04 '20 15:09 moreymat

I'd be very very excited to see that PR. Especially in the perspective of having a little better plain text construction, perhaps even placing the output of OCR results correctly in the surrounding text. Obviously that isn't a goal for the first shot, but for me it's a reason to really consider pdfminer medium-term.

pudo avatar Sep 07 '20 08:09 pudo

I've not tested Aleph yet while I'm continuously considering it for a project to analyze Arabic historical data. Within this project, we faced the issue of PDF extracting for Arabic text. Neither of these libraries genuinely extract Arabic the right way. However, we were able to do some changes in PDFMiner to fix all of these issues. The issues are as following:

  • Since parsing of PDF starts from LTR, doing the fix on the output to be RTL is not enough as everything got fixed except the Arabic ligatures. We were able to fix this by creating subclasses under PDFMiner to deal with this. I think this issue exists on all the mentioned library. Not sure!
  • The other problem that is not automatically resolvable is the CMAP for Arabic text. I found a lot of PDF files are missing or have a messy CMAPs. The CMAP should link the glyph to its Unicode character. This issue should be fixed manually for each affected file, hence, should be fixed before the ingestion process.
  • Right now, we are working on fixing the text blocks order. but it should not be an issue if only searching is required as text lines are grouped correctly with PDFMiner.

The problem with PDFMiner that it's slow as its purely python. But regarding its use with Aleph, it would be great to make the PDF extrating step extensible as other ingress steps.

I will finish the PDFMiner fixes and share the repository.

mkhashoggi avatar Sep 24 '20 14:09 mkhashoggi

@mkhashoggi Wow, that is incredible context for you to document. Thank you SO MUCH for taking the time to write it up for us. I would love to see the repository of PDFMiner overrides that you mention. I also am a committer on that project and could maybe help to upstream some of it.

pudo avatar Sep 24 '20 15:09 pudo

Thanks @pudo I will share the repo before the end of next week. Appreciate your support.

mkhashoggi avatar Sep 25 '20 09:09 mkhashoggi

Hello,

Here is the repo with the branch to supports RTL languages. It integrates with python-bidi to rearrange characters on text lines level to support bi-directional languages. https://github.com/mkhashoggi/pdfminer.six/tree/supports_rtl

I'm adding all features we already worked in such as detecting sub/superscripts, merging combining accents with its base characters, RTL-TopBottom order and finally detecting tables (I'm using camelot algorthim but purly with python, the result was accuracy identical to camelot and more than 10x faster. Maybe this is useful for Aleph to convert PDF tables to structured entities). I will be starting discussing this in PDFMiner page.

mkhashoggi avatar Oct 01 '20 10:10 mkhashoggi

@vinayak-mehta Do you know how well camelot performs with RTL languages? Run some comparison with pdfminer.six Thanks!

arky avatar Apr 03 '21 17:04 arky

I just checked the Arabic PDF test that's in Camelot right now, and it looks like it's broken probably because of the messy CMAP problem that @mkhashoggi mentioned, along with it not being able to differentiate an RTL language.

I need to check out https://github.com/mkhashoggi/pdfminer.six/tree/supports_rtl but based on @mkhashoggi's messages above, it looks like it would be a good candidate for addition into pdfminer.six!

@mkhashoggi I'm curious about the 10x table detection speedup that you mention! Where is the code for that? Would you like to contribute it to Camelot? :)

vinayak-mehta avatar Apr 04 '21 21:04 vinayak-mehta