ValidationError in TDocumentSchema().load for json from start_document_analysis LAYOUT
I used the below command to extract text from a pdf using textractor
response = client.start_document_analysis(
DocumentLocation=(
'S3Object': {
'Bucket': Bucket,
'Name': Name
}
},
FeatureTypes=['LAYOUT','FORMS'],
OutputConfig={
'S3Bucket': S3Bucket,
'S3Prefix': S3Prefix
},
KMSKeyId=KMSKeyId
)
I took the output file, added an extension ".json" (an optional step). Then I tried to run the example data extraction in csv format from the below page.
https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter
from textractprettyprinter import get_layout_csv_from_trp2
with open(<some_test_file>) as input_fp:
trp2_doc: TDocument = TDocumentSchema().load(json.load(input_fp))
layout_csv = get_layout_csv_from_trp2(trp2_doc)
csv_output = io.StringIO()
csv_writer = csv.writer(csv_output, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL)
for page in layout_csv:
csv_writer.writerows(page)
print(csv_output)
json.load(input_fp) works fine. But TDocumentSchema().load(json.load(input_fp)) is throwing "ValidationError"
Cell In[13], line 4
1 with open("1.json") as input_fp:
2 TDocumentSchema().load(json.load(input_fp))
File ~\.conda\envs\python310\lib\site-packages\marshmallow\schema.py:722, in Schema.load(self, data, many, partial, unknown)
691 def load(
692 self,
693 data: (
(...)
700 unknown: str | None = None,
701 ):
702 """Deserialize a data structure to an object defined by this schema's fields.
703
704 :param data: The data to deserialize.
(...)
720 if invalid data are passed.
721 """
722 return self._do_load(
723 data, many=many, partial=partial, unknown=unknown, postprocess=True
724 )
File ~\.conda\envs\python310\lib\site-packages\marshmallow\schema.py:909, in Schema._do_load(self, data, many, partial, unknown, postprocess)
907 exec = ValidationError(errors, data=data, valid_data=result)
908 self.handle_error(exc, data, many=many, partial=partial)
909 raise exc
911 return result
ValidationError: {'Blocks': {0: {'Confidence': ['Field may not be null'], 'Text': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null'].........
I tried with multi page pdf and single page pdf, but always getting this error.
I uploaded a document through console (Amazon Textract --> Bulk Document Uploader --> Upload Documents --> AnalyzeDocument - Layout), took the output 'analyzeDocResponse.json' and tried to do the same procedure. TDocumentSchema().load(json.load(input_fp)) worked fine on that. get_layout_csv_from_trp2(trp2_doc) failed with error "AttributeError: 'NoneType' object has no attribute 'ids'". I am not interested to get a fix for this error, as I will not be using this in Production. I just tested it.
Given below are the environment details
Operation System: Windows 11 Pro Python Version: 3.10.12
amazon-textract-caller==0.2.1 amazon-textract-pipeline-pagedimensions==0.0.9 amazon-textract-prettyprinter==0.1.8 amazon-textract-textractor==1.4.5 amazon-textract-response-parser==1.0.2 marshmallow==3.20.1 textract-trp==0.1.3
Any help to get this error resolved is highly appreciated.
If the asset is not confidential, please attach the .json file to the issue, it helps a lot when debugging. If you do not feel comfortable sharing the json on GitHub, you can also send it directly to belvae[at]amazon.com and I'll take a look.
Thanks
Actually the data is confidential. Unfortunately I will not be able to share it. The pdf had tables, hyprlinks, links and lists.
I have a similar issue when trying to convert a Document object to its trp2 representation. Tested with many documents, among them the fixture used in the Textractor test suite: textractor-singlepage-doc.pdf
document = parse_document_api_response(json.loads(response_list[0])["response"])
trp2_document = document.to_trp2()
The Document object from the Textractor parser:
This document holds the following data:
Pages - 1
Words - 53
Lines - 25
Key-values - 0
Checkboxes - 0
Tables - 1
Queries - 0
Signatures - 0
Identity Documents - 0
Expense Documents - 0
However calling trp2 raises the following validation error:
Error:
ValidationError: {'Blocks': {0: {'Confidence': ['Field may not be null.'], 'Text': ['Field may not be null.'], ....
I don't think it's an isolated issue as I have tested with different types of documents and have faced consistently the same error.