amazon-textract-textractor issues

page number is overwritten in function find_phrase_in_lines

8

The page number is overwritten if you pass it to the function within the for loop. Plus the page number is not considered as search criteria. [Source Code Snipped from...

tb102122

Need an option to save output in UTF-8 encoding to avoid saving as Windows-1252 encoding

1

It looks like the only way to capture the output of amazon-textract is to redirect it into a file. Such as: amazon-textract --input-document "s3://somebucket/2022-04-16-0010.jpg" --pretty-print LINES > 2022-04-16-0010.txt Unfortunately, this...

lihuib

How can I order the results as shown in the pdf?

2

PDF ![Captura de pantalla de 2022-05-05 15-38-43](https://user-images.githubusercontent.com/8030118/166937006-b560ace3-071d-4b15-81e7-555e32aba8ce.png) Example : python3 textractor.py --documents s3://mybucket/mydoc.pdf --forms Result : 62692bb61ab53-pdf-page-1-forms.csv ![Captura de pantalla de 2022-05-05 16-00-35](https://user-images.githubusercontent.com/8030118/166939976-90322dc4-d90a-479d-bb52-d5e19f82c8da.png) how can i order this way ![Captura...

robertdac

amazon-textract helper not works on windows 10

1

After run `python -m pip install amazon-textract-helper` It creates a file named "amazon-textract" at `%LOCALAPPDATA%\Programs\Python\Python38\Scripts` Note that is named "amazon-textract" not "amazon-textract.py", so windows 10 don't know how execute it...

CGarces

Repeated data in medical-insights-entities.csv and medical-insights-phi.json

1

I am planning to use Comprehend Medical in production in a new biomedical research product we are working on. I used Textractor to process an 1143 page pdf of a...

crashlurks

[Feature request] Output destination option

1

Thanks for the nice utility! However, my working directory is now an absolute mess 😂 It would be really helpful if something like an `--output` CLI option was available where...

athewsey

S3 folder limited to one 'page' of objects

[In textractor.py](https://github.com/aws-samples/amazon-textract-textractor/blob/ea5019475bb71b2adb1ad880f8d48b0f2b4e932f/src/textractor.py#L65), we currently seem to hard-code a limit of 1 max pages of S3 objects when calling `S3Helper.getFileNames()` to list the objects in an S3 folder input - even...

athewsey

[Feature request] Preserve folder structure

When applying textractor to a local folder or S3 prefix with an inner folder structure, it would be really useful if output files were also mapped to the same folder...

athewsey

Translation JSON response

When we use --translate, we get the translation for each page but the consolidated JSON response is -response.json not translated. How to generate the translation in the final JSON as...

rgrajan

enhancement

File name with space

If file name contains space, it's not processing. The loop is continuing with "IN PROGRESS" indefinitely. I have tested this in two environments, same behavior.

rgrajan

amazon-textract-textractor
amazon-textract-textractor copied to clipboard

Metadata

page number is overwritten in function find_phrase_in_lines

Need an option to save output in UTF-8 encoding to avoid saving as Windows-1252 encoding

How can I order the results as shown in the pdf?

amazon-textract helper not works on windows 10

Repeated data in medical-insights-entities.csv and medical-insights-phi.json

[Feature request] Output destination option

S3 folder limited to one 'page' of objects

[Feature request] Preserve folder structure

Translation JSON response

File name with space

← Metadata

Owner

Metadata

amazon-textract-textractor amazon-textract-textractor copied to clipboard

Metadata

← Metadata

Owner

Metadata

amazon-textract-textractor
amazon-textract-textractor copied to clipboard