unstructured bug/some tables in PDF not getting recognized

bug/some tables in PDF not getting recognized

Open Ritesh1137 opened this issue 2 months ago • 1 comments

Describe the bug I am processing PDF files of insurance sales brochures to identify tables. With high res strategy and infer-table set to True, I can identify most tables in the document consistently but am not able to identify two particular tables for some reason.

To Reproduce Process the file at this link PDF File with tables - Insurance sales brochure

code to process docs:

! pip install langchain unstructured[all-docs] pydantic lxml langchainhub
! sudo apt-get install poppler-utils tesseract-ocr

from typing import Any

from pydantic import BaseModel
from unstructured.partition.pdf import partition_pdf
# Get elements
raw_pdf_elements = partition_pdf(
    filename=path + "EndowmentPlan_JeevanLakshya.pdf",
    # Unstructured first finds embedded image blocks
    extract_images_in_pdf=False,
    # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
    # Titles are any sub-section of the document
    infer_table_structure=True,
    # Post processing to aggregate text once we have the title
    chunking_strategy="by_title",
    # Chunking params to aggregate text blocks
 # for v1, v2 = 3000, 1000
    max_characters=3500,
    new_after_n_chars=1500,
    combine_text_under_n_chars=250,
    image_output_dir_path=path,
)

Expected behavior In the processed files, all tables except the tables given below are processed and available as HTML.

Screenshots These tables are not processed correctly and are coming as text elements and not table elements.

tables

Environment Info Running on google colab, default free-tier.

May 09 '24 18:05 Ritesh1137

I recommend you to use our API and try specifying the model - "hi_res_model_name`="layout_v1.1.0". This model is not supported in open source.

elements = partition_via_api(
    filename=filename,
    api_key=<api_key>,
    strategy="hi_res",
    hi_res_model_name="layout_v1.1.0"
    chunking_strategy="by_title",
    max_characters=3500,
    new_after_n_chars=1500,
    combine_text_under_n_chars=250,
)

If you are gonna stick with open source, I advise on trying "zero out the background color" as a preprocessing before passing into partition:

basically identify the background color first
then convert those pixels into white background

May 10 '24 05:05 christinestraub

unstructured unstructured copied to clipboard

bug/some tables in PDF not getting recognized

unstructured
unstructured copied to clipboard