amazon-textract-textractor
amazon-textract-textractor copied to clipboard
page number is overwritten in function find_phrase_in_lines
The page number is overwritten if you pass it to the function within the for loop. Plus the page number is not considered as search criteria.
Source Code Snipped from line 1091ff
def find_phrase_in_lines(
self, phrase: str, min_textdistance=0.6, page_number: int = 1
) -> List[TWord]:
"""
phrase = words seperated by space char
"""
# first check if we already did find this phrase and stored it in the DB
# TODO: Problem: it will not find Current: when the phrase has current and there are other current values in the document without :
if not phrase:
raise ValueError(f"no valid phrase: '{phrase}")
phrase_words = phrase.split(" ")
if len(phrase_words) < 1:
raise ValueError(f"no valid phrase: '{phrase}")
# TODO: check for page_number impl
found_phrases: "list[TWord]" = self.ocrdb.select_text(
textract_doc_uuid=self.textract_doc_uuid,
text=make_alphanum_and_lower_for_non_numbers(phrase),
)
print("after ocrdb.select_text")
if found_phrases:
print("phrases found")
return found_phrases
alphanum_regex = re.compile(r"[\W_]+")
# find phrase (words that follow each other) in trp lines
for page in self.doc.pages:
page_number = 1
for line in page.lines:
......
`
@schadem I would suggest to extend the function call to accept an AreaSelection so that it can be passed into the call self.ocrdb.select_text( textract_doc_uuid=self.textract_doc_uuid, text=make_alphanum_and_lower_for_non_numbers(phrase), )
in the for loop I would remove line 1117.
Let me know if it is correct than I work on the PR.
blast from the past...
The find_phrase_in_lines
https://github.com/aws-samples/amazon-textract-textractor/blob/4b1e55426fc7fa623afcf210a2e3f5b51edc614c/tpipelinegeofinder/textractgeofinder/tgeofinder.py#L841
was my first implementation to find a phrase and essentially is replaced by find_phrase_on_page
https://github.com/aws-samples/amazon-textract-textractor/blob/4b1e55426fc7fa623afcf210a2e3f5b51edc614c/tpipelinegeofinder/textractgeofinder/tgeofinder.py#L769
I see find_intersect_value still uses the "lines" one like here, but I think that can be replaced with the phrases one
https://github.com/aws-samples/amazon-textract-textractor/blob/4b1e55426fc7fa623afcf210a2e3f5b51edc614c/tpipelinegeofinder/textractgeofinder/tgeofinder.py#L320
Tests use the lines method as well.
Essentially the "lines" method iterates over the trp object to find a match vs the "find_phrase_on_page" does use the in-memory sqlite. Unless you find good use for the lines method, I would recommend to remove it. The 'area' related methods all go back to DB anyway.
@tb102122 Thoughts?
@schadem yes sounds like a good approach I have added a warning for depreciation for now that we don't have breaking changes for other users.
@schadem I found one scenario which is not working for the function "find_phrase_on_page". If you are looking for a phrase like this Seite 1 und 2 der Kalkulation the result is not returned correctly. What I can see that this happens due to the string cleaning in line 786. At least it does not fine it in the line search in the search via the words it works.
https://github.com/aws-samples/amazon-textract-textractor/blob/4b1e55426fc7fa623afcf210a2e3f5b51edc614c/tpipelinegeofinder/textractgeofinder/tgeofinder.py#L783-L788
My suggestion would be that we add a flag "clean phrase" with default True and when False we phase in the phrase without cleaning just in lower case. What do you think?
Interesting. Do you have a sample page or Textract JSON for the "Seite 1 und 2 der Kalkulation"?
I can only share a very stripped down version if that works for you since the original documents contain a lot of PIA details. Let me know if that helps.
Thx. Any example I can build a unit test for helps.
@schadem Sorry took a bit longer to get the version without PIA details. Sample_redacted.pdf