amazon-textract-textractor
amazon-textract-textractor copied to clipboard
Intermittent failure of to_markdown() in lambda
On some pages I get the error 'KeyValue' object has no attribute 'reading_order' when trying to export to markdown in lambda.
I'm running a setup where I trigger textract with boto3 textract -> sns topic -> sqs -> lambda where the textractor library exports the markdown and tables to s3.
Some of the pages fail that error on page.to_markdown(). I tried retrieving the job locally with the Textractor client and it works.
I'm using the most recent lambda layer for pdfium as well.
2025-03-21T15:51:59.099-07:00
[ERROR] 2025-03-21T22:51:59.099Z 15546c43-451d-47eb-8d9c-c78a3834b88f Traceback (most recent call last):
File "/var/task/functions/complete_extraction.py", line 62, in process_md
md = self.page.to_markdown()
^^^^^^^^^^^^^^^^^^^^^^^
File "/var/task/textractor/entities/linearizable.py", line 59, in to_markdown
return self.get_text(config)
^^^^^^^^^^^^^^^^^^^^^
File "/var/task/textractor/entities/linearizable.py", line 24, in get_text
text, _ = self.get_text_and_words(config=config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/task/textractor/entities/page.py", line 169, in get_text_and_words
page_texts_and_words = [l.get_text_and_words(config) for l in sorted_layouts]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/var/task/textractor/entities/layout.py", line 142, in get_text_and_words
sorted(self.children, key=lambda x: x.reading_order)
File "/var/task/textractor/entities/layout.py", line 142, in <lambda>
sorted(self.children, key=lambda x: x.reading_order)
^^^^^^^^^^^^^^^
AttributeError: 'KeyValue' object has no attribute 'reading_order'
So the issue is transient and you do not have a reproducible test case.
What version of Textractor are you using?
@Belval we are also have the same issue:
ERROR OCCURRED: AttributeError: 'KeyValue' object has no attribute 'reading_order'
It happens when we specify TextFeature.LAYOUT on the attached image:
Here is the code:
from textractor import Textractor
from textractor.data.constants import TextractFeatures
extractor = Textractor(region_name='us-east-1')
document = extractor.analyze_document(
file_source=file_source,
features=[TextractFeatures.LAYOUT, TextractFeatures.FORMS, TextractFeatures.TABLES],
save_image=False
)
Each time I run this image Document.text prop is having this error. More over when I call document.get_text() exception is thrown.
We are using:
- amazon-textract-textractor ==1.9.0
- amazon-textract-caller ==0.2.4
- amazon-textract-response-parser ==1.0.3
I'm also having the same issue, it started happening from version 1.9.x and above, version 1.8.x works fine
I'm also having the same issue, it started happening from version 1.9.x and above, version 1.8.x works fine
I don't think the previous versions work as well.
This seems similar to #424 which should be fixed in 1.9.2. Can you test with the latest version?