amazon-textract-textractor issues

Textract missing most of the text in documents.

1

I am processing some fairly simple pdfs from S3 using textract document detection. For most of these documents, the returned JSON contains very little text. For example, using the pdf...

geoffwalmsley

Text is extracted but not grouped into forms and tables correctly

1

We're starting an invoice processing project and really like this library, but we're having one interesting issue: The text is all parsed correctly, but then it is not always grouped...

sichen1234

Currency symbols not identifying properly

1

Currency symbols not identifying properly. Pound symbol is recognised as E

pstibu-gmail

Textractor refactoring

This PR introduces a new way to use Textract and process its output in Python. It provides redesigned APIs for Text, Tables, Forms, Expense and AnalyseID to improve developer productivity,...

Belval

call_textract call_mode force_sync not calling sync for PDF/TIFF

schadem

JPEG conversion in `analyze_document` significantly impacts table predictions

1

When obtaining predictions through `analyze_document`, the image is converted to JPEG https://github.com/aws-samples/amazon-textract-textractor/blob/master/textractor/textractor.py#L845. The compression is enough to degrade the table predictions. We should check and keep the format, assuming that...

Belval

bug

Layout Linearization Duplicates text and Relegates Tables to the End

8

If you extract both LAYOUT and TABLEs, the tables for some reason are printed at the end of the output, rather than linearized correctly. Related issue: https://github.com/aws-samples/amazon-textract-textractor/issues/274 My code: `from...

kostabasis

Proper way of getting cell content?

5

The codebase has this line: https://github.com/aws-samples/amazon-textract-textractor/blob/28d6110b08a3584edc4c87022a41d12961b88688/textractor/entities/table.py#L640 to retrieve the cell content. But there's already `cell.text` I tried using `cell.text` but notice it's inaccurate. *Sometimes* it gives an empty string when...

ttruong-gilead

Large PDF response processing is slow

When processing large PDFs, processing the response after Textract has generated it can be noticeably slow. We should profile the response parser to identify the bottlenecks. This seems to be...

Belval

enhancement

latency

Error in get_layout_text_from_json in textractprettyprinter

Encountered this error in several documents because SELECTION_ELEMENT blocks (selection elements inside a table) do not contain the key 'Text'. Noticed that there is an issue with the same problem...

gwynethguo

amazon-textract-textractor
amazon-textract-textractor copied to clipboard

Metadata

Textract missing most of the text in documents.

Text is extracted but not grouped into forms and tables correctly

Currency symbols not identifying properly

Textractor refactoring

call_textract call_mode force_sync not calling sync for PDF/TIFF

JPEG conversion in `analyze_document` significantly impacts table predictions

Layout Linearization Duplicates text and Relegates Tables to the End

Proper way of getting cell content?

Large PDF response processing is slow

Error in get_layout_text_from_json in textractprettyprinter

← Metadata

Owner

Metadata

amazon-textract-textractor amazon-textract-textractor copied to clipboard

Metadata

← Metadata

Owner

Metadata

amazon-textract-textractor
amazon-textract-textractor copied to clipboard