unstructured
unstructured copied to clipboard
bug/PDF elements out of order
Describe the bug There is a discrepancy in the element order when partitioning a PDF. From the screenshots, the blue and red circles intended to highlight text are switched in position in the output image, compared to their correct placement in the original PDF.
To Reproduce
Run PDF partition using Python SDK with auto
, fast
, and hi_res
strategy.
Expected behavior The expected behavior is that the element order in the output image should match the placement and color coding (blue and red circles) as they are in the original PDF document.
Screenshots
Environment Info OS version: macOS-14.2.1-arm64-arm-64bit Python version: 3.10.12 unstructured version: 0.12.1.dev11 unstructured-inference version: 0.7.18 pytesseract version: 0.3.10 Torch version: 2.1.1 Detectron2 is not installed PaddleOCR is not installed Libmagic version: ==> libmagic: stable 5.45 (bottled) LibreOffice version: ==> libreoffice: 7.6.4
Additional context similar issue: https://github.com/Unstructured-IO/unstructured/issues/2208