img2table
img2table copied to clipboard
PDF table.box is inaccurate?
Hi. I'm trying to get some kind of bounding box alignment between the PDF (text extraction) method below and PyMuPDF's bounding boxes. The Img2TableImage module's bounding box is reasonably accurate and can be correlated to PyMuPDF's bounding box. The PDF bounding box is off. Is this a known issue, or is there a work-around?
PyMuPDF bounding box: (72.0375, 72.0625, 540.4875, 561.0) Image2Table Bounding Box (PDF module): (201, 201, 1503, 1328)
Much appreciation in advance
Extra for debugging:
Image2Table using the PDF (text extraction) module.
# Extract tables
extracted_tables = pdf.extract_tables(ocr=tesseract_ocr,
implicit_rows=False,
borderless_tables=False,
min_confidence=50)
extracted_tables
Extracted Image2Table table is:
bbox = (201, 201, 1503, 1328)
PyMuPDF:
doc = fitz.open(pdf_path)
for page_num in range(1, len(doc)):
tabs = doc[page_num].find_tables() # detect the tables
# print(page_num, tabs)
print(doc[page_num].rect.height)
for i, tab in enumerate(tabs): # iterate over all tables
for cell in tab.header.cells:
doc[page_num].draw_rect(cell,color=fitz.pdfcolor["red"],width=0.3)
print(f" Table bbox: {tab.bbox}")
doc[page_num].draw_rect(tab.bbox,color=fitz.pdfcolor["green"])
print(f"Table {i} column names: {tab.header.names}, external: {tab.header.external}")
extracted table with PymuPDF is:
bbox = (72.0375, 72.0625, 540.4875, 561.0)