Convert Document AI Object to Preserve Layout Text?
Is your feature request related to a problem? Please describe.
I've been using Google Document AI for text extraction from scanned documents, and it's been working well in terms of extracting text. However, I'm facing an issue when it comes to preserving the layout of the text.
In AWS Textract, there's a tool called "pretty print" that helps maintain the layout of extracted text. Tesseract, on the other hand, allows for preserving interword spaces using the config='-c preserve_interword_spaces=1' option which is kind of does the same thing.
I really wish if "python-documentai-toolbox" could support such output.
Describe the solution you'd like
documentai object => preserved layout text
Describe alternatives you've considered
Extracting text using the pdftotext library seemed like a viable option, but surprisingly, "python-documentai-toolbox" doesn't offer support for PDF output, which is rather baffling.
Can you provide more information on what you mean by "preserving the layout of the text"?
Do you want all of the text to be printed to the screen or a TXT file in the same general locations as the source document?
An example of an input document and the output text would be useful.
This will likely be difficult to implement since the layout information extracted from Document AI is using Bounding Boxes with X, Y coordinates (which doesn't apply cleanly to TXT files.)
Document AI by design doesn't fill in the Document.text field with extra spaces/tabs to signify where the text sits on the page.
It could be possible to use the Document.Page.Block field to identify blocks of text and place them generally in the same order, but again this isn't going to be very exact since Coordinates don't have a 1-1 relationship in text files.
@holtskinner thank you for your response! what i am looking for something like the example below.
image:

and the output I am getting is as follows:
Someto the left
Someto the left
Some in the middle
Some in the middle
Some with some tab
Some with some tab
Some with some space between them
Some with some space between them
Sometext here
Sometext here
this much
this much
How do I get the desired output string as of the same structure in image?
i.e. as follows:
Some text here
Some text here
Some to the left
Some to the left
Some in the middle
Some in the middle
Some with some tab
Some with some tab
Some with some space between them this much
Some with some space between them this much
-
also do you have an example how i can use
Document.Page.Blockto restructure the document ( ill give it a try)?
we want to do the same thing here!
At there very least, ensuring there are spaces between words in the text output from document AI would be of great assistance. Sometimes, when words are in different entities but next to each other, the Document AI text blob shows them as twowords as opposed to two words. Having a helper function ensure spaces are there would reduce custom post processing for us.
+1 I want the same thing. Currently I'm using PyMuPdf cli to achieve this python -m fitz gettext https://pymupdf.readthedocs.io/en/latest/recipes-text.html#how-to-extract-text-in-natural-reading-order
Wish the same thing for the document generic OCR (I think the underlying mechanism should be similar, basically reconstructing the layout from the bounding box information https://github.com/pymupdf/PyMuPDF/blob/c0ae13746155e9bb5c11ab7e9a42c2e73758422e/src/main.py#L802)
Hey all, I was able to get this mostly working! Here's a rough overview of the process for Python: -For each page in a document, create a reportlab Canvas object -Create a text layer on the Canvas object and write the text onto it, using the bounding box data -Save the PDF and use poppler or pypdf to extract the text layer into a layout-preserved .txt file
The one issue I'm still stuck on is handling documents when GCP performs preprocessing on them see my issue here
If someone is able to help me use the transforms field, I'm happy to invest some time tidying up my code and making a PR with the feature!
Attached is an example input and output. Input-SampleDocumentAITextLayout.pdf Output-SampleDocumentAITextLayout.txt