amazon-textract-textractor icon indicating copy to clipboard operation
amazon-textract-textractor copied to clipboard

Mistake a text field above a table as table title

Open oonisim opened this issue 5 months ago • 2 comments

Problem

Mistaking a text field as table title.

image

Environment

import textractor
textractor.__version__
-----
'1.7.4'

from platform import python_version
print(python_version())
-----
3..10.10

Reproduction

import os
import pathlib
from textractor import Textractor
from textractor.data.constants import TextractFeatures
import textractor

textractor.__version__

from platform import python_version
print(python_version())


DATA_DIR=pathlib.Path.home().joinpath("home/repository/data/ml/medical_report/pdf")
FILEPATH=DATA_DIR.joinpath("MedicalExaminerReportExample_13.pdf")


extractor = Textractor(profile_name="eml-ap-southeast-2")
document = extractor.analyze_document(
    file_source=str(FILEPATH),
    features=[
        TextractFeatures.LAYOUT, 
        TextractFeatures.FORMS, 
        TextractFeatures.TABLES
    ],
    save_image=True,  # To use images property and visualize of the document instance.
)

table = document.tables[0]
table.visualize()

print(table.title.text)
-----
'Addendum: On 02/28/2012 at approximately 1230 hours, FI Malphurs received a fax from Sanford Police Department confirming positive identification as: Trayvon Martin, 17yoa B/M, DOB: 02/05/1995. The Identification was made by his father from a crime scene photograph. ECC and MEO staff were notified and an Identification sheet was completed. TSM'

PDF

oonisim avatar Mar 01 '24 07:03 oonisim