unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug: correctly combine words spanning multiple lines

Open Coniferish opened this issue 2 years ago • 2 comments

The bug When partitioning pdfs using auto strategy, some elements contain words that are split over multiple lines and have a dash. Even though line separators are removed in the final element.text, the dash remains.

Example: When partitioning example-docs/layout-parser-paper-fast.pdf the word "distribution" in the numbered list spans multiple lines and is left broken in the combined list element:

(pdb) elements[22].text
'4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)'

Note: though this ListItem passes through _combine_list_elements, it is not like the other list items in this document that were broken and needed to be combined (meaning the bug occurs somewhere earlier in the call stack).

Coniferish avatar Dec 07 '23 17:12 Coniferish

Closing as inactive. Feel free to reopen if this is still a problem and you can provide a file that reproduces this behavior.

scanny avatar Dec 16 '24 19:12 scanny

Still an issue:

from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename="example-docs/pdf/layout-parser-paper-fast.pdf")
elements[22].text

Coniferish avatar Dec 17 '24 02:12 Coniferish