document-understanding-solution Bug described in AWS DUS (Case 9497917291)

Describe the bug Uploaded 3 pdf files (reasonably large). 2 files resulted in status "Failed", 1 file has Status "Ready", but when clicked upon, no data appears.

To Reproduce Please upload attached files into a DUS (kendra enabled)

Expected behavior

Please complete the following information about the solution:

[ ] Version: [e.g. v1.0.0] --- latest on Jan 17th
[ ] Region: [e.g. us-east-1] --- yes
[ ] Was the solution modified from the version published on this repository? --- no
[ ] If the answer to the previous question was yes, are the changes available on GitHub?
[ ] Have you checked your service quotas for the sevices this solution uses? --- yes
[ ] Were there any errors in the CloudWatch Logs? --- AWS service has reported issue is with lambda timeouts. 2021_Review paper on NPs from Mitchell and Langer lab.pdf US20200085974A1_Rna formulation for immunotherapy.pdf US20210259980A1_Compositions and methods for organ specific delivery of nucleic acids.pdf

Screenshots If applicable, add screenshots to help explain your problem (please DO NOT include sensitive information).

Additional context Add any other context about the problem here.

Jan 25 '22 22:01 samirpadomega

Hi @samirpadomega , I believe I talked to Aimad about this case and was able to solve the issues you are seeing:

For


botocore.errorfactory.TextSizeLimitExceededException: An error occurred (TextSizeLimitExceededException) when calling the BatchDetectEntities operation: Input text size exceeds limit. Max length of request text allowed is 5000 bytes while in this request the text size is 5002 bytes

you need to make the following change in source/lambda/helper/python/comprehendHelper.py line 81

projectedSize = len(
                            rawPages[pageResultIndex].encode('utf-8')) + len(block['Text'].encode('utf-8')) + 2

For ReadTimeOutError when calling pdfgeneratorlambda, you need to make the following changes in jobresultprocessorlambda in generatePdf():

import botocore (add this at the top of the file)

Line 39-40:

config = botocore.config.Config(read_timeout=900, connect_timeout=900)
client = boto3.client('lambda', config=config)

These 2 changes should solve the issues you are seeing for the documents attached.

Feb 09 '22 19:02 ShivaniMehendarge

@samirpadomega in version v1.0.10, we have also changed the default DPI in the pdfgenerator to 100 DPI from 300 DPI. This can also be controlled with an environment variable IMAGE_DPI set in the pdfgenerator lambda. I am closing this issue for now, but if you still face something that is not working, please feel free to reopen the ticket or create a new one.

Mar 07 '23 18:03 knihit

document-understanding-solution document-understanding-solution copied to clipboard

Bug described in AWS DUS (Case 9497917291)

document-understanding-solution
document-understanding-solution copied to clipboard