ingest-file
ingest-file copied to clipboard
Review PDF parsing strategy
What's broken?
- We're seeing incorrect text extraction out of some documents, especially those containing Arabic text.
- Text from images isn't being extracted into the right location in the remaining text
- We have to maintain our own PDF binary bindings
- We are extracting images in the documents to files first, then running OCR. Don't really need to put them on disk.
What are our options?
- Continue with pdflib
- Try out pdfreader - https://pdfreader.readthedocs.io/en/latest/tutorial.html#how-to-start
- Try out pdfminer.six
As far as I can tell, pdflib does not enable the proper extraction of text from a PDF with a multi-column layout, which is quite frequent for documents designed to be printed on paper, such as official documents from administrative authorities.
The output of pdflib is of the form:
line_1_col_A line_1_col_B
line_2_col_A line_2_col_B
...
@pudo suggested I had another look at pdfminer.six .
It does support multi-column layouts out of the box (see https://github.com/pdfminer/pdfminer.six/issues/276 ).
The parameters for layout analysis (https://github.com/pdfminer/pdfminer.six/blob/master/pdfminer/layout.py#L32) are exposed.
In my test cases, I only needed to adjust the value for char_margin
to recover the correct text structure.
I cannot perform thorough tests on edge cases, but is it useful if I prepare a minimal PR that replaces pdflib with pdfminer.six ?
I'd be very very excited to see that PR. Especially in the perspective of having a little better plain text construction, perhaps even placing the output of OCR results correctly in the surrounding text. Obviously that isn't a goal for the first shot, but for me it's a reason to really consider pdfminer medium-term.
I've not tested Aleph yet while I'm continuously considering it for a project to analyze Arabic historical data. Within this project, we faced the issue of PDF extracting for Arabic text. Neither of these libraries genuinely extract Arabic the right way. However, we were able to do some changes in PDFMiner to fix all of these issues. The issues are as following:
- Since parsing of PDF starts from LTR, doing the fix on the output to be RTL is not enough as everything got fixed except the Arabic ligatures. We were able to fix this by creating subclasses under PDFMiner to deal with this. I think this issue exists on all the mentioned library. Not sure!
- The other problem that is not automatically resolvable is the CMAP for Arabic text. I found a lot of PDF files are missing or have a messy CMAPs. The CMAP should link the glyph to its Unicode character. This issue should be fixed manually for each affected file, hence, should be fixed before the ingestion process.
- Right now, we are working on fixing the text blocks order. but it should not be an issue if only searching is required as text lines are grouped correctly with PDFMiner.
The problem with PDFMiner that it's slow as its purely python. But regarding its use with Aleph, it would be great to make the PDF extrating step extensible as other ingress steps.
I will finish the PDFMiner fixes and share the repository.
@mkhashoggi Wow, that is incredible context for you to document. Thank you SO MUCH for taking the time to write it up for us. I would love to see the repository of PDFMiner overrides that you mention. I also am a committer on that project and could maybe help to upstream some of it.
Thanks @pudo I will share the repo before the end of next week. Appreciate your support.
Hello,
Here is the repo with the branch to supports RTL languages. It integrates with python-bidi to rearrange characters on text lines level to support bi-directional languages. https://github.com/mkhashoggi/pdfminer.six/tree/supports_rtl
I'm adding all features we already worked in such as detecting sub/superscripts, merging combining accents with its base characters, RTL-TopBottom order and finally detecting tables (I'm using camelot algorthim but purly with python, the result was accuracy identical to camelot and more than 10x faster. Maybe this is useful for Aleph to convert PDF tables to structured entities). I will be starting discussing this in PDFMiner page.
@vinayak-mehta Do you know how well camelot performs with RTL languages? Run some comparison with pdfminer.six Thanks!
I just checked the Arabic PDF test that's in Camelot right now, and it looks like it's broken probably because of the messy CMAP problem that @mkhashoggi mentioned, along with it not being able to differentiate an RTL language.
I need to check out https://github.com/mkhashoggi/pdfminer.six/tree/supports_rtl but based on @mkhashoggi's messages above, it looks like it would be a good candidate for addition into pdfminer.six!
@mkhashoggi I'm curious about the 10x table detection speedup that you mention! Where is the code for that? Would you like to contribute it to Camelot? :)