Emit \n\n for paragraph breaks using a higher vertical-gap threshold
pdfplumber inserts \n between lines based on the y_tolerance parameter.
It would be great to also detect paragraph breaks: when a larger vertical gap is found (above a separate threshold), emit \n\n instead of \n. This would make paragraph boundaries detectable.
Thank you
If I'm understanding the question correctly, .extract_text(layout=True, ...) may produce the sort of output you're seeking. Or not quite?
I tested using extract_text(layout=True, ...), but I didn’t get the expected behavior.
Specifically, the method incorrectly detects double line breaks even when they don’t exist. In a paragraph where the text simply wraps because it doesn't fit on one line, the extracted result introduces extra line breaks, which makes it look like the paragraph is split into multiple ones.
Additionally, it generates a lot of unnecessary spaces and empty lines in sections where there are images or other non-text elements.