Incorrect order of text lines (use_text_flow=True)
Describe the bug
On certain PDFs, lines are returned in an unexpected order when use_text_flow is set to True.
Have you tried repairing the PDF?
Yes
Code to reproduce the problem
import pdfplumber
pdf_path = '/path/to/file.pdf'
with pdfplumber.open(pdf_path, repair=True) as pdf:
for page in pdf.pages:
lines = page.extract_text_lines(use_text_flow=True)
for line in lines:
print(line['text'])
PDF file
how-great-the-wisdom-and-the-love_bi.pdf
Expected behavior
Lines should be returned in this order:
- Stap tingbaot broken bodi blong Kraes, Taem yumi brekem bred. Dring wora long kap blong yumi witnes, Yumi putum Kraes long fored.
- Plan blong Papa God hem i komplit Blong savem yumi long ol sin. Hem i tekem Jastis, Lav mo Mersi Blong mekem plan blong Salvesen.
Actual behavior
Lines are returned in this order:
- Stap tingbaot broken bodi blong Kraes, Taem yumi brekem bred.
- Plan blong Papa God hem i komplit Blong savem yumi long ol sin. Hem i tekem Jastis, Lav mo Mersi Blong mekem plan blong Salvesen. Dring wora long kap blong yumi witnes, Yumi putum Kraes long fored.
Screenshots
Environment
pdfplumber version: 0.11.6 Python version: 3.12.8 OS: macOS 15.4 Sequoia
Here's another PDF with a similar issue – sorry, this is a complicated PDF, because it has music and non-Unicode fonts. I removed the music fonts to reduce complexity when debugging – hopefully that helps!
Code to reproduce the problem
import pdfplumber
pdf_path = '/path/to/file.pdf'
armenian_character_map = {' ': ' ', '(': '(', ')': ')', ',': ',', '-': '-', '.': '․', '/': '/', '0': '0', '1': '1', '2': '2', '3': '3', '4': '4', '5': '5', '6': '6', '7': '7', '8': '8', '9': '9', ':': '։', ';': ';', 'A': 'Ա', 'B': 'Բ', 'C': 'Ե', 'D': 'Դ', 'E': 'Է', 'F': 'Ֆ', 'G': 'Յ', 'H': 'Հ', 'I': 'Ի', 'J': 'Ջ', 'K': 'Կ', 'L': 'Լ', 'M': 'Մ', 'N': 'Ն', 'O': 'Օ', 'P': 'Պ', 'Q': '՜', 'R': 'Ր', 'S': 'Ս', 'T': 'Տ', 'U': 'Ւ', 'V': 'Վ', 'W': 'ղ', 'X': 'Խ', 'Y': 'Գ', 'Z': 'Զ', '`': '՚', 'a': 'ա', 'b': 'բ', 'c': 'ե', 'd': 'դ', 'e': 'է', 'f': 'ֆ', 'g': 'գ', 'h': 'հ', 'i': 'ի', 'j': 'ջ', 'k': 'կ', 'l': 'լ', 'm': 'մ', 'n': 'ն', 'o': 'օ', 'p': 'պ', 'q': 'ձ', 'r': 'ր', 's': 'ս', 't': 'տ', 'u': 'ւ', 'v': 'վ', 'w': 'և', 'x': 'խ', 'y': 'յ', 'z': 'զ', '{': 'փ', '}': '՞', '¡': '՝', '£': 'չ', '¥': 'Ճ', '©': '©', 'ª': 'Փ', '¬': 'Ը', '®': 'ո', '°': 'ճ', '´': 'շ', 'µ': 'Չ', 'º': 'Ո', '¿': 'Ց', 'Æ': 'Շ', 'Ø': 'ց', 'ß': 'ք', 'æ': 'ծ', 'ø': 'Թ', 'π': 'Ք', '–': '–', '“': '«', '”': '»', '‹': '՛', '™': 'ռ', 'Ω': 'Ծ', '∂': 'Ձ', '∏': 'Ղ', '√': 'Ժ', '∞': 'թ', '∫': 'Ռ', '≤': 'ը', '≥': 'ժ'}
with pdfplumber.open(pdf_path, repair=True) as pdf:
lines = pdf.pages[0].extract_text_lines(use_text_flow=True)
for line in lines:
deconded_text = ''.join([armenian_character_map.get(character, character) for character in line['text']])
print(line['text'].ljust(40), '|', deconded_text)
PDF file
Expected behavior
Lines should be returned in this order:
Խոսք՝ Վիլյամ Քլեյթըն 1814–1879 Երաժշտություն՝ Անգլիական ժողովրդական երգ Վարդապետություն և Ուխտեր 61։36–39 Վարդապետություն և Ուխտեր 59։1–4
Actual behavior
Lines are returned in this order:
Վիլյամ Քլեյթըն 1814–1879 Անգլիական ժողովրդական երգ Խոսք՝ Երաժշտություն՝ Վարդապետություն և Ուխտեր 61։36–39 Վարդապետություն և Ուխտեր 59։1–4
Screenshots
Hmm, perhaps I'm misunderstanding the issue raised here, but the purpose of use_text_flow=True is to parse the text by following the text's internal representation order rather than strictly the text's position on the page. So I would, in fact, expect that on some PDFs the line order is different than what you see on the page. If line order is important and use_text_flow=True is important for another reason, then one could use .extract_words(...) or . extract_text_lines(...), and then use the returned objects' position to reorder.
Thanks! Is there a way to verify the order of the text in the PDF's internal representation? I assumed it was something to do with the word grouping after extraction, similar to https://github.com/jsvine/pdfplumber/issues/1279.
my code get it ... if you have windows you can try ... or go line by line throug my code ^^
https://github.com/kalle07/parsing
Thanks! Is there a way to verify the order of the text in the PDF's internal representation? I assumed it was something to do with the word grouping after extraction, similar to #1279.
@samuelbradshaw, could you expand on what you mean by "verify the order"? In any case, one possible answer: Perhaps the simplest way would be to examine page.chars, which lists all characters in the order the PDF's commands describe them.