python-documentai-toolbox icon indicating copy to clipboard operation
python-documentai-toolbox copied to clipboard

Convert Document AI Object to Preserve Layout Text?

Open raad-altaie opened this issue 2 years ago • 11 comments

Is your feature request related to a problem? Please describe.

I've been using Google Document AI for text extraction from scanned documents, and it's been working well in terms of extracting text. However, I'm facing an issue when it comes to preserving the layout of the text.

In AWS Textract, there's a tool called "pretty print" that helps maintain the layout of extracted text. Tesseract, on the other hand, allows for preserving interword spaces using the config='-c preserve_interword_spaces=1' option which is kind of does the same thing. I really wish if "python-documentai-toolbox" could support such output.

Describe the solution you'd like

documentai object => preserved layout text

Describe alternatives you've considered

Extracting text using the pdftotext library seemed like a viable option, but surprisingly, "python-documentai-toolbox" doesn't offer support for PDF output, which is rather baffling.

raad-altaie avatar Aug 30 '23 23:08 raad-altaie

Can you provide more information on what you mean by "preserving the layout of the text"?

Do you want all of the text to be printed to the screen or a TXT file in the same general locations as the source document?

An example of an input document and the output text would be useful.

This will likely be difficult to implement since the layout information extracted from Document AI is using Bounding Boxes with X, Y coordinates (which doesn't apply cleanly to TXT files.)

Document AI by design doesn't fill in the Document.text field with extra spaces/tabs to signify where the text sits on the page.

It could be possible to use the Document.Page.Block field to identify blocks of text and place them generally in the same order, but again this isn't going to be very exact since Coordinates don't have a 1-1 relationship in text files.

holtskinner avatar Sep 08 '23 19:09 holtskinner

@holtskinner thank you for your response! what i am looking for something like the example below.

image:

input

and the output I am getting is as follows:

Someto the left
Someto the left

Some in the middle
Some in the middle

Some with some tab
Some with some tab

Some with some space between them
Some with some space between them

Sometext here
Sometext here

this much
this much

How do I get the desired output string as of the same structure in image?

i.e. as follows:

 										         Some text here
 										         Some text here

Some to the left
Some to the left

 					Some in the middle
 					Some in the middle

 		Some with some tab
 		Some with some tab

Some with some space between them						this much
Some with some space between them						this much
  • also do you have an example how i can use Document.Page.Block to restructure the document ( ill give it a try)?

raad-altaie avatar Sep 08 '23 21:09 raad-altaie

we want to do the same thing here!

think-diff avatar Sep 23 '23 02:09 think-diff

At there very least, ensuring there are spaces between words in the text output from document AI would be of great assistance. Sometimes, when words are in different entities but next to each other, the Document AI text blob shows them as twowords as opposed to two words. Having a helper function ensure spaces are there would reduce custom post processing for us.

ThreeHAN avatar Dec 05 '23 16:12 ThreeHAN

+1 I want the same thing. Currently I'm using PyMuPdf cli to achieve this python -m fitz gettext https://pymupdf.readthedocs.io/en/latest/recipes-text.html#how-to-extract-text-in-natural-reading-order

Wish the same thing for the document generic OCR (I think the underlying mechanism should be similar, basically reconstructing the layout from the bounding box information https://github.com/pymupdf/PyMuPDF/blob/c0ae13746155e9bb5c11ab7e9a42c2e73758422e/src/main.py#L802)

nonlocalStream avatar Apr 15 '24 20:04 nonlocalStream

Hey all, I was able to get this mostly working! Here's a rough overview of the process for Python: -For each page in a document, create a reportlab Canvas object -Create a text layer on the Canvas object and write the text onto it, using the bounding box data -Save the PDF and use poppler or pypdf to extract the text layer into a layout-preserved .txt file

The one issue I'm still stuck on is handling documents when GCP performs preprocessing on them see my issue here

If someone is able to help me use the transforms field, I'm happy to invest some time tidying up my code and making a PR with the feature!

Attached is an example input and output. Input-SampleDocumentAITextLayout.pdf Output-SampleDocumentAITextLayout.txt

zkalson avatar Apr 18 '24 22:04 zkalson