img2table icon indicating copy to clipboard operation
img2table copied to clipboard

PDF table.box is inaccurate?

Open grahama1970 opened this issue 5 months ago • 2 comments

Hi. I'm trying to get some kind of bounding box alignment between the PDF (text extraction) method below and PyMuPDF's bounding boxes. The Img2TableImage module's bounding box is reasonably accurate and can be correlated to PyMuPDF's bounding box. The PDF bounding box is off. Is this a known issue, or is there a work-around?

PyMuPDF bounding box: (72.0375, 72.0625, 540.4875, 561.0) Image2Table Bounding Box (PDF module): (201, 201, 1503, 1328)

Much appreciation in advance

Extra for debugging:

Image2Table using the PDF (text extraction) module.

# Extract tables
extracted_tables = pdf.extract_tables(ocr=tesseract_ocr,
                                      implicit_rows=False,
                                      borderless_tables=False,
                                      min_confidence=50)

extracted_tables

Extracted Image2Table table is: bbox = (201, 201, 1503, 1328)

PyMuPDF:

doc = fitz.open(pdf_path)
for page_num in range(1, len(doc)):
    tabs = doc[page_num].find_tables()  # detect the tables
    
    # print(page_num, tabs)
    print(doc[page_num].rect.height)
    for i, tab in enumerate(tabs):  # iterate over all tables
        for cell in tab.header.cells:
            doc[page_num].draw_rect(cell,color=fitz.pdfcolor["red"],width=0.3)
        print(f"  Table bbox: {tab.bbox}")
        doc[page_num].draw_rect(tab.bbox,color=fitz.pdfcolor["green"])
        print(f"Table {i} column names: {tab.header.names}, external: {tab.header.external}")

extracted table with PymuPDF is: bbox = (72.0375, 72.0625, 540.4875, 561.0)

grahama1970 avatar Sep 23 '24 13:09 grahama1970