Martin Schade comments

Results 20 comments of


                                            Martin Schade

Currency symbols not identifying properly

Amazon Textract is continuously improved and customer feedback like yours help with that task. Since your post the service has been updated and especially some currency related symbols have improved....

add confidence to list

That is a API breaking change. Should have added more initially, was a quick hack to get it out.... Maybe use the trp2.convert_queries_to_list_trp2 (https://github.com/aws-samples/amazon-textract-textractor/blob/d324b360dec724fc40bf46fe9f2441e8e403903f/prettyprinter/textractprettyprinter/t_pretty_print.py#L147) Or we can add another method....

Not able to extract Textract merge cell text properly

We should add an option to pass in a function that can be used instead of the fixed logic.

To add fields to ID Document schema

Interesting. ID schema is specific for ID documents, the generic Textract APIs AnalyzeDocument, DetectDocumentText (and their async Start* and Get*) allow for very flexible definition of documents, which is covered...

Added textType property to Word class

There have been some changes, a BaseBlock was introduced, the TextType is a general property not limited to WORD. I could add that, but would close this PR or you...

Ruby version

Ruby support would be cool. Can you add some tests and a README?

Need an option to save output in UTF-8 encoding to avoid saving as Windows-1252 encoding

Makes sense. Thx! We'll add a separate output option.

Text is extracted but not grouped into forms and tables correctly

Sorry for the late response. Could you post a sample image to test?

Textract missing most of the text in documents.

I just ran a test on the document you linked, an 80 page 'Infrastructure Funding and Financing Bill' through Textract and got 31856 words identified, which seems to cover the...

page number is overwritten in function find_phrase_in_lines

blast from the past... The ```find_phrase_in_lines``` https://github.com/aws-samples/amazon-textract-textractor/blob/4b1e55426fc7fa623afcf210a2e3f5b51edc614c/tpipelinegeofinder/textractgeofinder/tgeofinder.py#L841 was my first implementation to find a phrase and essentially is replaced by ```find_phrase_on_page``` https://github.com/aws-samples/amazon-textract-textractor/blob/4b1e55426fc7fa623afcf210a2e3f5b51edc614c/tpipelinegeofinder/textractgeofinder/tgeofinder.py#L769 I see find_intersect_value still uses the "lines" one...