amazon-textract-textractor issues

Caller: allow early return when job incomplete

1

I'm using Textract in a web application. I'm enqueuing jobs using `extractor.start_document_analysis` and storing the returned job ID in a database. Later, I call `extractor.get_result(job_id)` to get the response for...

symroe

enhancement

[Doc] Documentation of Linearizable and their methods e.g, get_text(config)

1

```Document``` class has ```get_text(config: TextLinearizationConfig)``` method as in the example [Using Layout Analysis for Text Linearization](https://aws-samples.github.io/amazon-textract-textractor/notebooks/layout_analysis_for_text_linearization.html) cell 19. ``` from textractor.data.text_linearization_config import TextLinearizationConfig config = TextLinearizationConfig( hide_figure_layout=True, title_prefix="# ", section_header_prefix="##...

oonisim

enhancement

documentation

[Doc] BoundingBox coordinate unit and scale are unclear

2

[classtextractor.entities.bbox.BoundingBox(x: float, y: float, width: float, height: float, spatial_object=None)](https://aws-samples.github.io/amazon-textract-textractor/textractor.entities.html?highlight=tabletitle#textractor.entities.bbox.BoundingBox) says: > Represents the bounding box of an object in the format of a dataclass with (x, y, width, height). By...

oonisim

The key property of the KeyValue class does not return Line instance

3

KeyValue class [key](https://aws-samples.github.io/amazon-textract-textractor/textractor.entities.html#textractor.entities.key_value.KeyValue.key) property says that it returns ```Line```. > Return type. [Line](https://aws-samples.github.io/amazon-textract-textractor/textractor.entities.html#textractor.entities.line.Line) However, it returns Python List\[Word\] and there is no [text](https://aws-samples.github.io/amazon-textract-textractor/textractor.entities.html#textractor.entities.line.Line.text) property available which the Line class to...

oonisim

Mistake a text field above a table as table title

2

# Problem Mistaking a text field as table title. ## Environment ``` import textractor textractor.__version__ ----- '1.7.4' from platform import python_version print(python_version()) ----- 3..10.10 ``` ## Reproduction ``` import os...

oonisim

Overlayer broken with DocumentDimension not subscritable

1

--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[45], line 12 9 document_dimension:DocumentDimensions = DocumentDimensions(doc_width=image.size[0], doc_height=image.size[1]) 10 overlay=[Textract_Types.WORD, Textract_Types.CELL] ---> 12 bounding_box_list = get_bounding_boxes(textract_json=doc, document_dimensions=document_dimension, overlay_features=overlay) File ~/anaconda3/envs/python3/lib/python3.10/site-packages/textractoverlayer/t_overlay.py:103, in get_bounding_boxes(textract_json,...

miluna8

Add LazyObject to lazy load pdf to image conversion

*Issue #, if available:* #316 (related) *Description of changes:* This PR would add make the PDF to image conversion lazy (only runs when a user accesses the image) to avoid...

Belval

For textractor.entities.line.Line - visualize() breaks

1

When trying to visualize "Line" objects I am getting: ``` 106 return EntityList(list(set(new_entity_list))).visualize( 107 with_text=with_text, 108 with_words=with_words, 109 with_confidence=with_confidence, 110 font_size_ratio=font_size_ratio, 111 ) 112 elif len(self) > 0 and self[0].bbox.spatial_object.image...

h55nick

Issue with multipage PDFs on s3 without extension

2

Hello, first of all thanks for the awesome package. I am currently having an issue trying to run textractor on my PDFs that are stored in s3. The issue stems...

lvieirajr

ci: Use GITHUB_OUTPUT envvar instead of set-output command

`save-state` and `set-output` commands used in GitHub Actions are deprecated and [GitHub recommends using environment files](https://github.blog/changelog/2023-07-24-github-actions-update-on-save-state-and-set-output-commands/). This PR updates the usage of `::set-output` to `"$GITHUB_OUTPUT"` Instructions for envvar usage from...

arunsathiya

amazon-textract-textractor
amazon-textract-textractor copied to clipboard

Metadata

Caller: allow early return when job incomplete

[Doc] Documentation of Linearizable and their methods e.g, get_text(config)

[Doc] BoundingBox coordinate unit and scale are unclear

The key property of the KeyValue class does not return Line instance

Mistake a text field above a table as table title

Overlayer broken with DocumentDimension not subscritable

Add LazyObject to lazy load pdf to image conversion

For textractor.entities.line.Line - visualize() breaks

Issue with multipage PDFs on s3 without extension

ci: Use GITHUB_OUTPUT envvar instead of set-output command

← Metadata

Owner

Metadata

amazon-textract-textractor amazon-textract-textractor copied to clipboard

Metadata

← Metadata

Owner

Metadata

amazon-textract-textractor
amazon-textract-textractor copied to clipboard