pdfplumber Incorrect order of text lines (use_text

Describe the bug

On certain PDFs, lines are returned in an unexpected order when use_text_flow is set to True.

Have you tried repairing the PDF?

Yes

Code to reproduce the problem

import pdfplumber

pdf_path = '/path/to/file.pdf'

with pdfplumber.open(pdf_path, repair=True) as pdf:
  for page in pdf.pages:
    lines = page.extract_text_lines(use_text_flow=True)
    for line in lines:
      print(line['text'])

PDF file

how-great-the-wisdom-and-the-love_bi.pdf

Expected behavior

Lines should be returned in this order:

Stap tingbaot broken bodi blong Kraes, Taem yumi brekem bred. Dring wora long kap blong yumi witnes, Yumi putum Kraes long fored.
Plan blong Papa God hem i komplit Blong savem yumi long ol sin. Hem i tekem Jastis, Lav mo Mersi Blong mekem plan blong Salvesen.

Actual behavior

Lines are returned in this order:

Stap tingbaot broken bodi blong Kraes, Taem yumi brekem bred.
Plan blong Papa God hem i komplit Blong savem yumi long ol sin. Hem i tekem Jastis, Lav mo Mersi Blong mekem plan blong Salvesen. Dring wora long kap blong yumi witnes, Yumi putum Kraes long fored.

Screenshots

Environment

pdfplumber version: 0.11.6 Python version: 3.12.8 OS: macOS 15.4 Sequoia

Apr 06 '25 05:04 samuelbradshaw

Here's another PDF with a similar issue – sorry, this is a complicated PDF, because it has music and non-Unicode fonts. I removed the music fonts to reduce complexity when debugging – hopefully that helps!

Code to reproduce the problem

import pdfplumber

pdf_path = '/path/to/file.pdf'

armenian_character_map = {' ': ' ', '(': '(', ')': ')', ',': ',', '-': '-', '.': '․', '/': '/', '0': '0', '1': '1', '2': '2', '3': '3', '4': '4', '5': '5', '6': '6', '7': '7', '8': '8', '9': '9', ':': '։', ';': ';', 'A': 'Ա', 'B': 'Բ', 'C': 'Ե', 'D': 'Դ', 'E': 'Է', 'F': 'Ֆ', 'G': 'Յ', 'H': 'Հ', 'I': 'Ի', 'J': 'Ջ', 'K': 'Կ', 'L': 'Լ', 'M': 'Մ', 'N': 'Ն', 'O': 'Օ', 'P': 'Պ', 'Q': '՜', 'R': 'Ր', 'S': 'Ս', 'T': 'Տ', 'U': 'Ւ', 'V': 'Վ', 'W': 'ղ', 'X': 'Խ', 'Y': 'Գ', 'Z': 'Զ', '`': '՚', 'a': 'ա', 'b': 'բ', 'c': 'ե', 'd': 'դ', 'e': 'է', 'f': 'ֆ', 'g': 'գ', 'h': 'հ', 'i': 'ի', 'j': 'ջ', 'k': 'կ', 'l': 'լ', 'm': 'մ', 'n': 'ն', 'o': 'օ', 'p': 'պ', 'q': 'ձ', 'r': 'ր', 's': 'ս', 't': 'տ', 'u': 'ւ', 'v': 'վ', 'w': 'և', 'x': 'խ', 'y': 'յ', 'z': 'զ', '{': 'փ', '}': '՞', '¡': '՝', '£': 'չ', '¥': 'Ճ', '©': '©', 'ª': 'Փ', '¬': 'Ը', '®': 'ո', '°': 'ճ', '´': 'շ', 'µ': 'Չ', 'º': 'Ո', '¿': 'Ց', 'Æ': 'Շ', 'Ø': 'ց', 'ß': 'ք', 'æ': 'ծ', 'ø': 'Թ', 'π': 'Ք', '–': '–', '“': '«', '”': '»', '‹': '՛', '™': 'ռ', 'Ω': 'Ծ', '∂': 'Ձ', '∏': 'Ղ', '√': 'Ժ', '∞': 'թ', '∫': 'Ռ', '≤': 'ը', '≥': 'ժ'}

with pdfplumber.open(pdf_path, repair=True) as pdf:
  lines = pdf.pages[0].extract_text_lines(use_text_flow=True)
  for line in lines:
    deconded_text = ''.join([armenian_character_map.get(character, character) for character in line['text']])
    print(line['text'].ljust(40), '|', deconded_text)

PDF file

come-come-ye-saints_hy.pdf

Expected behavior

Lines should be returned in this order:

Խոսք՝ Վիլյամ Քլեյթըն 1814–1879 Երաժշտություն՝ Անգլիական ժողովրդական երգ Վարդապետություն և Ուխտեր 61։36–39 Վարդապետություն և Ուխտեր 59։1–4

Actual behavior

Lines are returned in this order:

Վիլյամ Քլեյթըն 1814–1879 Անգլիական ժողովրդական երգ Խոսք՝ Երաժշտություն՝ Վարդապետություն և Ուխտեր 61։36–39 Վարդապետություն և Ուխտեր 59։1–4

Screenshots

Apr 12 '25 01:04 samuelbradshaw

Hmm, perhaps I'm misunderstanding the issue raised here, but the purpose of use_text_flow=True is to parse the text by following the text's internal representation order rather than strictly the text's position on the page. So I would, in fact, expect that on some PDFs the line order is different than what you see on the page. If line order is important and use_text_flow=True is important for another reason, then one could use .extract_words(...) or . extract_text_lines(...), and then use the returned objects' position to reorder.

Apr 21 '25 03:04 jsvine

Thanks! Is there a way to verify the order of the text in the PDF's internal representation? I assumed it was something to do with the word grouping after extraction, similar to https://github.com/jsvine/pdfplumber/issues/1279.

Apr 21 '25 16:04 samuelbradshaw

my code get it ... if you have windows you can try ... or go line by line throug my code ^^

https://github.com/kalle07/parsing

May 17 '25 14:05 kalle07

Thanks! Is there a way to verify the order of the text in the PDF's internal representation? I assumed it was something to do with the word grouping after extraction, similar to #1279.

@samuelbradshaw, could you expand on what you mean by "verify the order"? In any case, one possible answer: Perhaps the simplest way would be to examine page.chars, which lists all characters in the order the PDF's commands describe them.

Jun 12 '25 02:06 jsvine

Incorrect order of text lines (use_text_flow=True)

Describe the bug

Have you tried repairing the PDF?

Code to reproduce the problem

PDF file

Expected behavior

Actual behavior

Screenshots

Environment

Code to reproduce the problem

PDF file

Expected behavior

Actual behavior

Screenshots