bug: correctly combine words spanning multiple lines
The bug
When partitioning pdfs using auto strategy, some elements contain words that are split over multiple lines and have a dash. Even though line separators are removed in the final element.text, the dash remains.
Example:
When partitioning example-docs/layout-parser-paper-fast.pdf the word "distribution" in the numbered list spans multiple lines and is left broken in the combined list element:
(pdb) elements[22].text
'4. A DL model hub and community platform for the easy sharing, distribu- tion, and discussion of DIA models and pipelines, to promote reusability, reproducibility, and extensibility (Section 4)'
Note: though this ListItem passes through _combine_list_elements, it is not like the other list items in this document that were broken and needed to be combined (meaning the bug occurs somewhere earlier in the call stack).
Closing as inactive. Feel free to reopen if this is still a problem and you can provide a file that reproduces this behavior.
Still an issue:
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(filename="example-docs/pdf/layout-parser-paper-fast.pdf")
elements[22].text