amazon-textract-textractor icon indicating copy to clipboard operation
amazon-textract-textractor copied to clipboard

heuristic_line_break_threshold, along with other heuristic constants not doing anything

Open kostabasis opened this issue 1 year ago • 4 comments

I noticed that even when testing extreme values of heuristic_line_break_threshold, heuristic_overlap_ratio, and heuristic_h_tolerance there was no change in the output. This led me to examine their use in the library, and it appears heuristic_line_break_threshold is never once utilized outside of the class parameter declaration. The other two are used, but still do nothing. Were these features simply released prematurely, or am I doing something wrong?

This is my basic logic for getting output:

`from textractor.data.text_linearization_config import TextLinearizationConfig from textractor import Textractor

extractor = Textractor(profile_name="default")

document = extractor.analyze_document( file_source=png_path, features=[TextractFeatures.TABLES, TextractFeatures.SIGNATURES, TextractFeatures.LAYOUT], save_image=True )

config = TextLinearizationConfig( title_prefix="# ", section_header_prefix="## ", add_prefixes_and_suffixes_in_text=True, heuristic_line_break_threshold=0.9, heuristic_overlap_ratio=0.5, table_linearization_format="markdown", ) print(document.get_text(config=config))`

kostabasis avatar Jan 17 '24 20:01 kostabasis

Good catch, this should be addressed by #298

Belval avatar Jan 18 '24 22:01 Belval

Hey, is there any chance this is already fixed in the PR? I tried locally testing the current PR with the commands below and heuristic_line_break_threshold doesn't seem to be used anywhere still.

git clone [email protected]:aws-samples/amazon-textract-textractor.git
cd amazon-textract-textractor
git checkout origin version-1.7.0
pip install -e .

kostabasis avatar Jan 26 '24 18:01 kostabasis

See https://github.com/aws-samples/amazon-textract-textractor/pull/298/commits/e9da3b0438598b3e2f99f810d1893d1ff65c2125

Belval avatar Jan 26 '24 18:01 Belval

I see, thank you. It was an issue on my end.

kostabasis avatar Jan 26 '24 18:01 kostabasis