img2table PDF table.box is inaccurate?

PDF table.box is inaccurate?

Open grahama1970 opened this issue 5 months ago • 2 comments

Hi. I'm trying to get some kind of bounding box alignment between the PDF (text extraction) method below and PyMuPDF's bounding boxes. The Img2TableImage module's bounding box is reasonably accurate and can be correlated to PyMuPDF's bounding box. The PDF bounding box is off. Is this a known issue, or is there a work-around?

PyMuPDF bounding box: (72.0375, 72.0625, 540.4875, 561.0) Image2Table Bounding Box (PDF module): (201, 201, 1503, 1328)

Much appreciation in advance

Extra for debugging:

Image2Table using the PDF (text extraction) module.

# Extract tables
extracted_tables = pdf.extract_tables(ocr=tesseract_ocr,
                                      implicit_rows=False,
                                      borderless_tables=False,
                                      min_confidence=50)

extracted_tables

Extracted Image2Table table is: bbox = (201, 201, 1503, 1328)

PyMuPDF:

doc = fitz.open(pdf_path)
for page_num in range(1, len(doc)):
    tabs = doc[page_num].find_tables()  # detect the tables
    
    # print(page_num, tabs)
    print(doc[page_num].rect.height)
    for i, tab in enumerate(tabs):  # iterate over all tables
        for cell in tab.header.cells:
            doc[page_num].draw_rect(cell,color=fitz.pdfcolor["red"],width=0.3)
        print(f"  Table bbox: {tab.bbox}")
        doc[page_num].draw_rect(tab.bbox,color=fitz.pdfcolor["green"])
        print(f"Table {i} column names: {tab.header.names}, external: {tab.header.external}")

extracted table with PymuPDF is: bbox = (72.0375, 72.0625, 540.4875, 561.0)

Sep 23 '24 13:09 grahama1970

img2table img2table copied to clipboard

PDF table.box is inaccurate?

img2table
img2table copied to clipboard