unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug/ a bottom part of the table is truncated when partition_pdf is used

Open ohnohoya opened this issue 4 months ago • 1 comments

Describe the bug I am using the partition_pdf function to extract tables from a PDF file publicly available (https://www.highmarkbcbswv.com/PDFFiles/ANSI-reason-codes.pdf). After running the OCR, the elements contain every row of the table (the table spans multiple pages). On the other hand, the tables are missing last 5-6 rows of each page.

To Reproduce Download the pdf file from https://www.highmarkbcbswv.com/PDFFiles/ANSI-reason-codes.pdf and save it locally, and provide the full path to:

# Download the pdf file from https://www.highmarkbcbswv.com/PDFFiles/ANSI-reason-codes.pdf and save it locally, and provide the full path to:
INPUT_DOCUMENT=
elements = partition_pdf(
   filename=INPUT_DOCUMENT,
   strategy='hi_res',
   infer_table_structure=True);
tables = [el for el in elements if el.category == "Table"]
for table in tables:
   print(table.metadata.text_as_html)
   print(" ")
for element in elements[:100]:
   print(element.text)
   print(" ")

Environment Info python -m torch.utils.collect_env Collecting environment information... PyTorch version: 2.1.2 Is debug build: False CUDA used to build PyTorch: None ROCM used to build PyTorch: N/A OS: macOS 14.2.1 (arm64) GCC version: Could not collect Clang version: 15.0.0 (clang-1500.1.0.2.5) CMake version: version 3.28.3 Libc version: N/A Python version: 3.11.7 | packaged by conda-forge | (main, Dec 23 2023, 14:38:07) [Clang 16.0.6 ] (64-bit runtime) Python platform: macOS-14.2.1-arm64-arm-64bit Is CUDA available: False CUDA runtime version: No CUDA CUDA_MODULE_LOADING set to: N/A GPU models and configuration: No CUDA Nvidia driver version: No CUDA cuDNN version: No CUDA HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Apple M1 Pro Versions of relevant libraries: [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.3 [pip3] torch==2.1.2 [pip3] torchvision==0.16.2 [conda] numpy 1.26.3 pypi_0 pypi [conda] torch 2.1.2 pypi_0 pypi [conda] torchvision 0.16.2 pypi_0 pypi

ohnohoya avatar Feb 16 '24 00:02 ohnohoya

@ohnohoya Did you try to extract tables using the freemium API? The API has better support for table extraction than our open source offerings.

christinestraub avatar Feb 21 '24 19:02 christinestraub