amazon-textract-textractor icon indicating copy to clipboard operation
amazon-textract-textractor copied to clipboard

issue with ordering in extractions, markdown and gettext methods

Open red-sky17 opened this issue 1 year ago • 14 comments

the attached input document contains text then a table followed by some text, we want the text file to be the same as the input pdf file.

input_page

I tried extraction using different methods:

for 1.) and 2.) this is the code I am using: textract_json = extractor.start_document_analysis( file_source="s3://s3sagemakerbucket/textract_analysis/12382593_bnp_credit_facility_20m.pdf", features=[TextractFeatures.LAYOUT, TextractFeatures.TABLES], save_image=False, ) response_textract_async = extractor.get_result(job_id=textract_json.job_id, api=Textract_API.ANALYZE) markdown_text = response_textract_async.to_markdown() 1.) .to_markdown() method using_markdown_method the issue here is the two table are at the bottom.

2.) .get_text() method using_gettext_method in this case as well we can see the two tables are at the bottom and like we know without config parameter we wont get markdown output.

now the third is interesting the code used for this is: from textractcaller.t_call import call_textract, Textract_Features from textractprettyprinter.t_pretty_print import get_text_from_layout_json

textract_json = call_textract(input_document="s3://s3sagemakerbucket/textract_analysis/12382593_bnp_credit_facility_20m.pdf", features=[Textract_Features.LAYOUT,Textract_Features.TABLES],) 3.) get_text_from_layout_json(textract_json=textract_json) also tried with get_text_from_layout_json(textract_json=textract_json, generate_markdown = True) in both of these cases getting the same output. using_gettextfromlayout_1 using_gettextfromlayout_2

the issue in using this method is like you can see, the data is getting repeated twice, also there is no markdown format present.

@Belval or anyone can you please suggest if there is anything we can do to prevent this and get the text in correct like we have in the pdf file.

Thanks.

red-sky17 avatar Aug 17 '24 12:08 red-sky17

also do look into this output for the attached pdf as well, same issue is being observed here as well for the 1st page the tables are being printed down and as for the second page Egypt_EG01_Credit Agricole.pdf

this is for 2nd page: second_pdf_usingmarkdown

complete text file: Egypt_EG01_Credit Agricole_using_markdown.txt

red-sky17 avatar Aug 17 '24 13:08 red-sky17

where as the ordering is present in this text file when extracted using get_text_from_layout_json(textract_json=textract_json) the issue is same like the one discussed in the first thread (3.).

text file for reference:

Egypt_EG01_Credit Agricole_using_gettextfromlayout_json.txt

I am thinking is this a bug for .to_markdown() and get_text() methods because for gettextfromlayoutjson() we are getting the output in correct order.

ultimately the final goal is to get the extraction like we did in gettextfromlayoutjson but with markdown bordering and no duplication.

so, I believe it would be better if we could get the extraction properly by using .to_markdown method only, because in this method we have markdown bordering and the only issue is ordering which can debugged I guess by comparing the gettextfromlayoutjson and to_markdown functions code of traversing the json dict.

red-sky17 avatar Aug 17 '24 14:08 red-sky17

I will test it first but this looks like a known issue that happens when the LAYOUT predictions do not match the TABLE predictions, causing the reading order to be wrong.

Belval avatar Aug 20 '24 18:08 Belval

What version of amazon-textract-textractor are you using? With 1.8.2 I get:

Page 2 of 10


Schneider Electric South East Asia (HQ) Pte. Ltd. Schneider Electric Overseas Asia Pte Ltd Schneider Electric Singapore Pte. Ltd. Schneider Electric IT Singapore Pte. Ltd. (formerly known as MGE Asia Pte Ltd) Schneider Electric IT Logistics Asia Pacific Pte. Ltd. Schneider Electric Logistics Asia Pte Ltd Schneider Electric Systems Singapore Pte. Ltd. (formerly known as Invensys Process Systems (S) Pte. Ltd.) 1 March 2017 

Previous Facility Letters. In the event that this Facility Letter is not accepted or lapses and is not extended by the Bank, the terms and conditions in the Previous Facility Letters shall continue to apply, save for any revision or amendments to the Interest Rate and any reduction in the amount of the Lines of Credit as stated herein. 

## A. LINE(S) OF CREDIT 



| AMOUNT          | TYPE                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|-----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| SGD20,000,000/- | Multi-currency Banker's Guarantee [including but not limited to Performance Guarantee or Payment Guarantee (for up to 60 months or such other tenor as may be agreed by the Bank from time to time) or to finance any other transactions acceptable to the Bank on a case-by-case subject to such conditions as may be determined by the Bank in its sole and absolute discretion] and/or Sight & Usance Letters of Credit (for up to 12 months) (with/ without control of goods) and/or Shipping Guarantee & Acceptance Under Usance Letters of Credit. |



## 1. PURPOSE 

The Facilities shall be used solely to finance the Borrower's working capital requirements. However, without prejudice to the Borrower's obligations, the Bank shall not be obliged to check that the Borrower does so or that the Facilities or any part thereof is utilized in such a manner. 

## 2. INTEREST RATE/COMMISSION/FEE 

(a) Commission on Banker's Guarantee shall be calculated on the face amount of the Banker's Guarantee for the period from the date of issuance upto the expiry date of the Banker's Guarantee, payable upfront as follows :- 
 
(b) Non-refundable Commission / Interest on the Trade Facilities shall be payable at the following rates and in the following manner:- 
(i) Letters of Credit 0.125% per month, minimum 2 months 



| Tenor                    | Commission    |
|--------------------------|---------------|
| Less than 3 years,       | 0.2%pa        |
| 3 years and upto 5 years | 0.25%pa       |

Which does not match what you are reporting.

Belval avatar Aug 20 '24 18:08 Belval

@Belval , I am attaching the input pdf, when tested on the single page like I attached( in the first thread, which you tested) its giving the same output like you got, but when tested as a whole(pdf) that is when I am facing issue.

I am using amazon-textract-textractor version 1.8.2

this_pdf.pdf

red-sky17 avatar Aug 21 '24 03:08 red-sky17

Thank you for clarifying and sharing the file, I will attempt to reproduce the issue.

Belval avatar Aug 21 '24 16:08 Belval

Hello @Belval, were you able to reproduce this issue.

red-sky17 avatar Sep 12 '24 09:09 red-sky17

I have noticed this a few times myself.

If order is important, I would usually get the bbox of the entity and sort by x or y axis.

Combining page ordering, together with entity bboxes guarantees that order is maintain in the output.

Of course, you will need to know the format of you input pdf beforehand to do this.

Chuukwudi avatar Nov 17 '24 23:11 Chuukwudi

We have a fix for this issue that will be included into the 1.8.6 version. It should be available by March 7th.

Belval avatar Feb 18 '25 21:02 Belval

Should be fixed in 1.9.0, let me know if that addresses your issue. The tables are not insert correctly in the output.

Note that this will only fix the insertion in cases where the split can be done unambiguously (no overlap) otherwise it will default to the previous case.

Belval avatar Mar 07 '25 21:03 Belval

I will leave the issue open until you can confirm that this is fixed.

Belval avatar Mar 07 '25 21:03 Belval

@Belval can we get 1.9.0 pushed to the pypi repository please? https://pypi.org/project/amazon-textract-textractor/#history

Latest version is showing as 1.8.5 atm

Image

gertct avatar Mar 11 '25 15:03 gertct

Thank for the heads up. 1.9.0 should be in PyPI now. Note that it can take 1-2 hours for their cache to refresh.

See: https://github.com/aws-samples/amazon-textract-textractor/actions/runs/13792246430

Belval avatar Mar 11 '25 15:03 Belval

I will leave the issue open until you can confirm that this is fixed.

@Belval ,I tested it from my side and the results are way better now, the ordering issue is resolved, I noticed some discrepancies but those are minor and are expected I believe given the complex structure of the document.

I confirm the fix, we can close this ,thanks.

red-sky17 avatar Apr 02 '25 19:04 red-sky17