amazon-textract-textractor icon indicating copy to clipboard operation
amazon-textract-textractor copied to clipboard

Layout Linearization Duplicates text and Relegates Tables to the End

Open kostabasis opened this issue 1 year ago • 8 comments

If you extract both LAYOUT and TABLEs, the tables for some reason are printed at the end of the output, rather than linearized correctly. Related issue: https://github.com/aws-samples/amazon-textract-textractor/issues/274 My code: `from textractor.data.text_linearization_config import TextLinearizationConfig from textractor import Textractor extractor = Textractor(profile_name="default")

document = extractor.analyze_document( file_source=png_path, features=[TextractFeatures.LAYOUT, TextractFeatures.TABLES, TextractFeatures.SIGNATURES], save_image=True, )

config = TextLinearizationConfig( title_prefix="# ", section_header_prefix="## ",
add_prefixes_and_suffixes_in_text=True, table_tabulate_format="fancy_grid".lower(), table_remove_column_headers=True, )

extracted_text = document.get_text(config=config) print(get_text_from_layout_json(textract_json=document.response, generate_markdown=True)[1])`

kostabasis avatar Jan 18 '24 21:01 kostabasis

Can you share an asset that exhibits this behavior? It should be addressed by https://github.com/aws-samples/amazon-textract-textractor/pull/298/ but I would like to make sure before deploying it.

Belval avatar Jan 18 '24 22:01 Belval

Unfortunately I'm working with sensitive documents, and my boss told me we cannot share them. We re-tested and it seems that for the majority of documents it works now - only some are buggy, and we can't seem to repro on anything that we can share. However, I can give some context. I suspect its because the buggy documents almost look like they are two column documents, they are composed of a list of labels, to the right of which there is the corresponding information. Sometimes, this information is a table(for example a table of rents over time). Like so: _ represents a space, for formatting

  1. LEASE a. Individual leasing_______________Harry Potter b. Date__________________________01/01/01 c. Rents ___________________________ ______________________________Months _______________Amount _______________07-09/1900$25 _______________09-12/1900$25 _______________01-12/1902$35 _______________01-12/1903$40 _______________01-12/1904$50 d. Landlord_________________Tom Riddle e. Terms. ______________________________ - ... ...

In this example, the months table is parsed correctly, but in the linearized layout it is relegated to the end of the document. This document is also kind of like a form, but enabling the form feature messes up the output as well, and it doesn't detect that the table is associated with the "Rents" key. Besides, its more of a list and there is really no need for key-value pairs here. Any suggestions are welcome, if this is more of a usage error than library shortcoming.

kostabasis avatar Jan 19 '24 17:01 kostabasis

That's fine, to test locally you can do:

  1. git clone [email protected]:aws-samples/amazon-textract-textractor.git
  2. cd amazon-textract-textractor
  3. git checkout origin version-1.7.0
  4. pip install -e .
  5. Test

When testing you may want to visualize the results document.pages[0].layouts.visualize().save("out.png"), what often happens when table are relegated to the end is that they overlap with a larger layout element such as a LAYOUT_KEY_VALUES, this "pushes" the table lower in the linearized text.

If you find an asset that you can share, feel free to send it directly at belvae [AT] amazon.com.

Hope that helps!

Belval avatar Jan 19 '24 17:01 Belval

Awesome, thanks for all the help!

kostabasis avatar Jan 19 '24 17:01 kostabasis

I'm also seeing an issue like this using amazon-textract-textractor 1.7.2.

I noticed that closest_reading_order_distance in Page#get_text_and_words isn't being set anywhere, so the if statement checking closest_reading_order_distance is None is always true. This results in the unsorted layouts being inserted after the same layout on each iteration (because the unsorted layouts being inserted have a reading order of -1), so the last element is always the same one. https://github.com/aws-samples/amazon-textract-textractor/blob/5ea39f8e1621836d0d357666d651aa88630dbbcb/textractor/entities/page.py#L159-L179

I tried setting closest_reading_order_distance = dist inside the if statement after L173 (I'm assuming that was the intention), but that breaks the reading order pretty much everywhere AFAICT.

Running get_layout_csv_from_trp2 with the same textract response results in the correct reading order (but in CSV format).

stevehodgkiss avatar Feb 18 '24 09:02 stevehodgkiss

Hi @stevehodgkiss, what Textract features are used to get your Textract response? The code you shared does seem to have the bug you described, but I am not convinced that this is the cause of the behaviour that you are describing.

Could you possibly provide the response itself and the original asset(s)? Reproducing the issue helps with troubleshooting.

Belval avatar Feb 19 '24 14:02 Belval

Hi @Belval, I'm running start_document_analysis with "QUERIES", "SIGNATURES", "LAYOUT", "TABLES", "FORMS". I'll send you the specific page & response that causes the issue by email. I believe it could also be related to the way textract itself is generating the response (there's a layout element within a list that appears larger than it should be AFAICT), but it's interesting that get_layout_from_trpc2 can correctly linearize it.

stevehodgkiss avatar Feb 19 '24 18:02 stevehodgkiss

I had the same experience. get_layout_from_trpc2 on my document gave a correctly linearized CSV.

kostabasis avatar Mar 20 '24 20:03 kostabasis