amazon-textract-textractor icon indicating copy to clipboard operation
amazon-textract-textractor copied to clipboard

Intermittent failure of to_markdown() in lambda

Open samiam376 opened this issue 8 months ago • 5 comments

On some pages I get the error 'KeyValue' object has no attribute 'reading_order' when trying to export to markdown in lambda.

I'm running a setup where I trigger textract with boto3 textract -> sns topic -> sqs -> lambda where the textractor library exports the markdown and tables to s3.

Some of the pages fail that error on page.to_markdown(). I tried retrieving the job locally with the Textractor client and it works.

I'm using the most recent lambda layer for pdfium as well.

	
2025-03-21T15:51:59.099-07:00
[ERROR]	2025-03-21T22:51:59.099Z	15546c43-451d-47eb-8d9c-c78a3834b88f	Traceback (most recent call last):
  File "/var/task/functions/complete_extraction.py", line 62, in process_md
    md = self.page.to_markdown()
         ^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/task/textractor/entities/linearizable.py", line 59, in to_markdown
    return self.get_text(config)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/var/task/textractor/entities/linearizable.py", line 24, in get_text
    text, _ = self.get_text_and_words(config=config)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/task/textractor/entities/page.py", line 169, in get_text_and_words
    page_texts_and_words = [l.get_text_and_words(config) for l in sorted_layouts]
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/task/textractor/entities/layout.py", line 142, in get_text_and_words
    sorted(self.children, key=lambda x: x.reading_order)
  File "/var/task/textractor/entities/layout.py", line 142, in <lambda>
    sorted(self.children, key=lambda x: x.reading_order)
                                        ^^^^^^^^^^^^^^^
AttributeError: 'KeyValue' object has no attribute 'reading_order'

samiam376 avatar Mar 21 '25 22:03 samiam376

So the issue is transient and you do not have a reproducible test case.

What version of Textractor are you using?

Belval avatar Mar 26 '25 17:03 Belval

@Belval we are also have the same issue:

ERROR OCCURRED: AttributeError: 'KeyValue' object has no attribute 'reading_order'

It happens when we specify TextFeature.LAYOUT on the attached image:

Image

Here is the code:

from textractor import Textractor
from textractor.data.constants import TextractFeatures

extractor = Textractor(region_name='us-east-1')

document = extractor.analyze_document(
    file_source=file_source,
    features=[TextractFeatures.LAYOUT, TextractFeatures.FORMS, TextractFeatures.TABLES],
    save_image=False
)

Each time I run this image Document.text prop is having this error. More over when I call document.get_text() exception is thrown.

We are using:

  • amazon-textract-textractor ==1.9.0
  • amazon-textract-caller ==0.2.4
  • amazon-textract-response-parser ==1.0.3

Falstafff avatar Mar 26 '25 18:03 Falstafff

I'm also having the same issue, it started happening from version 1.9.x and above, version 1.8.x works fine

marbonestu avatar Apr 10 '25 10:04 marbonestu

I'm also having the same issue, it started happening from version 1.9.x and above, version 1.8.x works fine

I don't think the previous versions work as well.

aditinayak01 avatar Apr 10 '25 19:04 aditinayak01

This seems similar to #424 which should be fixed in 1.9.2. Can you test with the latest version?

Belval avatar Apr 25 '25 16:04 Belval