amazon-textract-response-parser icon indicating copy to clipboard operation
amazon-textract-response-parser copied to clipboard

ValidationError in TDocumentSchema().load for json from start_document_analysis LAYOUT

Open Risho92 opened this issue 2 years ago • 3 comments

I used the below command to extract text from a pdf using textractor

response = client.start_document_analysis(
	DocumentLocation=(
		'S3Object': {
			'Bucket': Bucket,
			'Name': Name
			}
		},
		FeatureTypes=['LAYOUT','FORMS'],
		OutputConfig={
			'S3Bucket': S3Bucket,
			'S3Prefix': S3Prefix
		},
	KMSKeyId=KMSKeyId
)

I took the output file, added an extension ".json" (an optional step). Then I tried to run the example data extraction in csv format from the below page.

https://github.com/aws-samples/amazon-textract-textractor/tree/master/prettyprinter

from textractprettyprinter import get_layout_csv_from_trp2

with open(<some_test_file>) as input_fp:
    trp2_doc: TDocument = TDocumentSchema().load(json.load(input_fp))
    layout_csv = get_layout_csv_from_trp2(trp2_doc)
    csv_output = io.StringIO()
    csv_writer = csv.writer(csv_output, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for page in layout_csv:
        csv_writer.writerows(page)
    print(csv_output)

json.load(input_fp) works fine. But TDocumentSchema().load(json.load(input_fp)) is throwing "ValidationError"
Cell In[13], line 4
	1 with open("1.json") as input_fp:
	2 	TDocumentSchema().load(json.load(input_fp))

File ~\.conda\envs\python310\lib\site-packages\marshmallow\schema.py:722, in Schema.load(self, data, many, partial, unknown)
	691 def load(
	692 	self,
	693 	data: (
	(...)
	700 unknown: str | None = None,
	701 ):
	702 		"""Deserialize a data structure to an object defined by this schema's fields.
	703
	704 		:param data: The data to deserialize.
	(...)
	720 			if invalid data are passed.
	721			"""
	722 	return self._do_load(
	723 		data, many=many, partial=partial, unknown=unknown, postprocess=True
	724 	)
	
File ~\.conda\envs\python310\lib\site-packages\marshmallow\schema.py:909, in Schema._do_load(self, data, many, partial, unknown, postprocess)
	907 	exec = ValidationError(errors, data=data, valid_data=result)
	908 	self.handle_error(exc, data, many=many, partial=partial)
	909 	raise exc
	911 return result
	
ValidationError: {'Blocks': {0: {'Confidence': ['Field may not be null'], 'Text': ['Field may not be null.'], 'ColumnIndex': ['Field may not be null'].........

I tried with multi page pdf and single page pdf, but always getting this error.

I uploaded a document through console (Amazon Textract --> Bulk Document Uploader --> Upload Documents --> AnalyzeDocument - Layout), took the output 'analyzeDocResponse.json' and tried to do the same procedure. TDocumentSchema().load(json.load(input_fp)) worked fine on that. get_layout_csv_from_trp2(trp2_doc) failed with error "AttributeError: 'NoneType' object has no attribute 'ids'". I am not interested to get a fix for this error, as I will not be using this in Production. I just tested it.


Given below are the environment details

Operation System: Windows 11 Pro Python Version: 3.10.12

amazon-textract-caller==0.2.1 amazon-textract-pipeline-pagedimensions==0.0.9 amazon-textract-prettyprinter==0.1.8 amazon-textract-textractor==1.4.5 amazon-textract-response-parser==1.0.2 marshmallow==3.20.1 textract-trp==0.1.3

Any help to get this error resolved is highly appreciated.

Risho92 avatar Dec 01 '23 21:12 Risho92

If the asset is not confidential, please attach the .json file to the issue, it helps a lot when debugging. If you do not feel comfortable sharing the json on GitHub, you can also send it directly to belvae[at]amazon.com and I'll take a look.

Thanks

Belval avatar Dec 01 '23 21:12 Belval

Actually the data is confidential. Unfortunately I will not be able to share it. The pdf had tables, hyprlinks, links and lists.

Risho92 avatar Dec 01 '23 23:12 Risho92

I have a similar issue when trying to convert a Document object to its trp2 representation. Tested with many documents, among them the fixture used in the Textractor test suite: textractor-singlepage-doc.pdf

document = parse_document_api_response(json.loads(response_list[0])["response"])
trp2_document = document.to_trp2()

The Document object from the Textractor parser:

This document holds the following data:
Pages - 1
Words - 53
Lines - 25
Key-values - 0
Checkboxes - 0
Tables - 1
Queries - 0
Signatures - 0
Identity Documents - 0
Expense Documents - 0

However calling trp2 raises the following validation error: Error: ValidationError: {'Blocks': {0: {'Confidence': ['Field may not be null.'], 'Text': ['Field may not be null.'], .... I don't think it's an isolated issue as I have tested with different types of documents and have faced consistently the same error.

sarahboufelja avatar Jan 06 '25 13:01 sarahboufelja