amazon-textract-textractor
amazon-textract-textractor copied to clipboard
Analyze documents with Amazon Textract and generate output in multiple formats.
I noticed that even when testing extreme values of heuristic_line_break_threshold, heuristic_overlap_ratio, and heuristic_h_tolerance there was no change in the output. This led me to examine their use in the library,...
The Library is unable to fetch text under Manufacturer/Model: However we are able to see it via AWS Textract console
The imports for [t_pretty_print_layout.py](https://github.com/aws-samples/amazon-textract-textractor/blob/master/prettyprinter/textractprettyprinter/t_pretty_print_layout.py): ``` import os import warnings import logging from trp.trp2 import TDocument from typing import List ``` Then[ line 262](https://github.com/aws-samples/amazon-textract-textractor/blob/780307f6db12250160d5809ec1524449bcbd22d3/prettyprinter/textractprettyprinter/t_pretty_print_layout.py#L262) expects `t2`: ``` relationships: t2.TRelationship = page.get_relationships_for_type()...
There's an issue when I get the text in Markdown format. For some reason, all the lists duplicate the text. First as "plaintext" and then with the proper Markdown format....
When passing `TextLinearizationConfig(linearize_table=False` to into `document.get_text` with a document that includes a table, the linearized output still includes the table. Ex code: ``` document = extractor.analyze_document( file_source=image, features=[ TextractFeatures.TABLES, TextractFeatures.FORMS,...
Is there any functionality that allows us to filter different type of features when extracting text ? The text of tables and others is getting extracted.
Hello, I was trying to run examples from this repo. I do not have prior experience with AWS. I am using conda env on Ubuntu 22.04. As per instructions I...
I am parsing an existing JSON response from the asynchronous call - **textract.start_document_analysis()** but it fails to parse it. I have a multipage pdf. I get an AssertionError - ```...
Hi, if I have a several hundred page PDF, is it possible to select a subset of pages to parse?
Hello. How should we do, when we have a pdf with multiple pages? Thanks, Alex