amazon-textract-textractor
amazon-textract-textractor copied to clipboard
Analyze documents with Amazon Textract and generate output in multiple formats.
I'm using Textract in a web application. I'm enqueuing jobs using `extractor.start_document_analysis` and storing the returned job ID in a database. Later, I call `extractor.get_result(job_id)` to get the response for...
```Document``` class has ```get_text(config: TextLinearizationConfig)``` method as in the example [Using Layout Analysis for Text Linearization](https://aws-samples.github.io/amazon-textract-textractor/notebooks/layout_analysis_for_text_linearization.html) cell 19. ``` from textractor.data.text_linearization_config import TextLinearizationConfig config = TextLinearizationConfig( hide_figure_layout=True, title_prefix="# ", section_header_prefix="##...
[classtextractor.entities.bbox.BoundingBox(x: float, y: float, width: float, height: float, spatial_object=None)](https://aws-samples.github.io/amazon-textract-textractor/textractor.entities.html?highlight=tabletitle#textractor.entities.bbox.BoundingBox) says: > Represents the bounding box of an object in the format of a dataclass with (x, y, width, height). By...
KeyValue class [key](https://aws-samples.github.io/amazon-textract-textractor/textractor.entities.html#textractor.entities.key_value.KeyValue.key) property says that it returns ```Line```. > Return type. [Line](https://aws-samples.github.io/amazon-textract-textractor/textractor.entities.html#textractor.entities.line.Line) However, it returns Python List\[Word\] and there is no [text](https://aws-samples.github.io/amazon-textract-textractor/textractor.entities.html#textractor.entities.line.Line.text) property available which the Line class to...
# Problem Mistaking a text field as table title. ## Environment ``` import textractor textractor.__version__ ----- '1.7.4' from platform import python_version print(python_version()) ----- 3..10.10 ``` ## Reproduction ``` import os...
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) Cell In[45], line 12 9 document_dimension:DocumentDimensions = DocumentDimensions(doc_width=image.size[0], doc_height=image.size[1]) 10 overlay=[Textract_Types.WORD, Textract_Types.CELL] ---> 12 bounding_box_list = get_bounding_boxes(textract_json=doc, document_dimensions=document_dimension, overlay_features=overlay) File ~/anaconda3/envs/python3/lib/python3.10/site-packages/textractoverlayer/t_overlay.py:103, in get_bounding_boxes(textract_json,...
*Issue #, if available:* #316 (related) *Description of changes:* This PR would add make the PDF to image conversion lazy (only runs when a user accesses the image) to avoid...
When trying to visualize "Line" objects I am getting: ``` 106 return EntityList(list(set(new_entity_list))).visualize( 107 with_text=with_text, 108 with_words=with_words, 109 with_confidence=with_confidence, 110 font_size_ratio=font_size_ratio, 111 ) 112 elif len(self) > 0 and self[0].bbox.spatial_object.image...
Hello, first of all thanks for the awesome package. I am currently having an issue trying to run textractor on my PDFs that are stored in s3. The issue stems...
`save-state` and `set-output` commands used in GitHub Actions are deprecated and [GitHub recommends using environment files](https://github.blog/changelog/2023-07-24-github-actions-update-on-save-state-and-set-output-commands/). This PR updates the usage of `::set-output` to `"$GITHUB_OUTPUT"` Instructions for envvar usage from...