pdfplumber icon indicating copy to clipboard operation
pdfplumber copied to clipboard

Emit \n\n for paragraph breaks using a higher vertical-gap threshold

Open fabrii opened this issue 2 months ago • 2 comments

pdfplumber inserts \n between lines based on the y_tolerance parameter.

It would be great to also detect paragraph breaks: when a larger vertical gap is found (above a separate threshold), emit \n\n instead of \n. This would make paragraph boundaries detectable.

Thank you

fabrii avatar Nov 12 '25 15:11 fabrii

If I'm understanding the question correctly, .extract_text(layout=True, ...) may produce the sort of output you're seeking. Or not quite?

jsvine avatar Nov 13 '25 04:11 jsvine

I tested using extract_text(layout=True, ...), but I didn’t get the expected behavior.

Specifically, the method incorrectly detects double line breaks even when they don’t exist. In a paragraph where the text simply wraps because it doesn't fit on one line, the extracted result introduces extra line breaks, which makes it look like the paragraph is split into multiple ones.

Additionally, it generates a lot of unnecessary spaces and empty lines in sections where there are images or other non-text elements.

fabrii avatar Nov 17 '25 15:11 fabrii