amazon-textract-textractor issues

heuristic_line_break_threshold, along with other heuristic constants not doing anything

4

I noticed that even when testing extreme values of heuristic_line_break_threshold, heuristic_overlap_ratio, and heuristic_h_tolerance there was no change in the output. This led me to examine their use in the library,...

kostabasis

No able to fetch Handwritten Text from Document

1

The Library is unable to fetch text under Manufacturer/Model: However we are able to see it via AWS Textract console

naconcirrus

need repro

two line import fix for get_layout_csv_from_trp2

1

The imports for [t_pretty_print_layout.py](https://github.com/aws-samples/amazon-textract-textractor/blob/master/prettyprinter/textractprettyprinter/t_pretty_print_layout.py): ``` import os import warnings import logging from trp.trp2 import TDocument from typing import List ``` Then[ line 262](https://github.com/aws-samples/amazon-textract-textractor/blob/780307f6db12250160d5809ec1524449bcbd22d3/prettyprinter/textractprettyprinter/t_pretty_print_layout.py#L262) expects `t2`: ``` relationships: t2.TRelationship = page.get_relationships_for_type()...

scott-norm

Issue with Markdown output (textractprettyprinter)

1

There's an issue when I get the text in Markdown format. For some reason, all the lists duplicate the text. First as "plaintext" and then with the proper Markdown format....

jpbalarini

linearize_table False doesn't exclude table

5

When passing `TextLinearizationConfig(linearize_table=False` to into `document.get_text` with a document that includes a table, the linearized output still includes the table. Ex code: ``` document = extractor.analyze_document( file_source=image, features=[ TextractFeatures.TABLES, TextractFeatures.FORMS,...

eilam-stream

Get Text from Document object but filter out Tables and Forms Text

1

Is there any functionality that allows us to filter different type of features when extracting text ? The text of tables and others is getting extracted.

prashants975

I am missing something... ~/.aws/config and ~/.aws/credentials

3

Hello, I was trying to run examples from this repo. I do not have prior experience with AWS. I am using conda env on Ubuntu 22.04. As per instructions I...

chandailrc

parse an existing JSON - from textract.start_document_analysis() throws AssertionError

9

I am parsing an existing JSON response from the asynchronous call - **textract.start_document_analysis()** but it fails to parse it. I have a multipage pdf. I get an AssertionError - ```...

sankalp-wns

need repro

Parse specific pages only

4

Hi, if I have a several hundred page PDF, is it possible to select a subset of pages to parse?

austinmw

enhancement

Analyze a document with multiple pages

2

Hello. How should we do, when we have a pdf with multiple pages? Thanks, Alex

alexandruvesa

amazon-textract-textractor
amazon-textract-textractor copied to clipboard

Metadata

heuristic_line_break_threshold, along with other heuristic constants not doing anything

No able to fetch Handwritten Text from Document

two line import fix for get_layout_csv_from_trp2

Issue with Markdown output (textractprettyprinter)

linearize_table False doesn't exclude table

Get Text from Document object but filter out Tables and Forms Text

I am missing something... ~/.aws/config and ~/.aws/credentials

parse an existing JSON - from textract.start_document_analysis() throws AssertionError

Parse specific pages only

Analyze a document with multiple pages

← Metadata

Owner

Metadata

amazon-textract-textractor amazon-textract-textractor copied to clipboard

Metadata

← Metadata

Owner

Metadata

amazon-textract-textractor
amazon-textract-textractor copied to clipboard