pdfplumber icon indicating copy to clipboard operation
pdfplumber copied to clipboard

Incorrect order of text lines (use_text_flow=True)

Open samuelbradshaw opened this issue 9 months ago • 5 comments

Describe the bug

On certain PDFs, lines are returned in an unexpected order when use_text_flow is set to True.

Have you tried repairing the PDF?

Yes

Code to reproduce the problem

import pdfplumber

pdf_path = '/path/to/file.pdf'

with pdfplumber.open(pdf_path, repair=True) as pdf:
  for page in pdf.pages:
    lines = page.extract_text_lines(use_text_flow=True)
    for line in lines:
      print(line['text'])

PDF file

how-great-the-wisdom-and-the-love_bi.pdf

Expected behavior

Lines should be returned in this order:

  1. Stap tingbaot broken bodi blong Kraes, Taem yumi brekem bred. Dring wora long kap blong yumi witnes, Yumi putum Kraes long fored.
  2. Plan blong Papa God hem i komplit Blong savem yumi long ol sin. Hem i tekem Jastis, Lav mo Mersi Blong mekem plan blong Salvesen.

Actual behavior

Lines are returned in this order:

  1. Stap tingbaot broken bodi blong Kraes, Taem yumi brekem bred.
  2. Plan blong Papa God hem i komplit Blong savem yumi long ol sin. Hem i tekem Jastis, Lav mo Mersi Blong mekem plan blong Salvesen. Dring wora long kap blong yumi witnes, Yumi putum Kraes long fored.

Screenshots

Image

Environment

pdfplumber version: 0.11.6 Python version: 3.12.8 OS: macOS 15.4 Sequoia

samuelbradshaw avatar Apr 06 '25 05:04 samuelbradshaw

Here's another PDF with a similar issue – sorry, this is a complicated PDF, because it has music and non-Unicode fonts. I removed the music fonts to reduce complexity when debugging – hopefully that helps!

Code to reproduce the problem

import pdfplumber

pdf_path = '/path/to/file.pdf'

armenian_character_map = {' ': ' ', '(': '(', ')': ')', ',': ',', '-': '-', '.': '․', '/': '/', '0': '0', '1': '1', '2': '2', '3': '3', '4': '4', '5': '5', '6': '6', '7': '7', '8': '8', '9': '9', ':': '։', ';': ';', 'A': 'Ա', 'B': 'Բ', 'C': 'Ե', 'D': 'Դ', 'E': 'Է', 'F': 'Ֆ', 'G': 'Յ', 'H': 'Հ', 'I': 'Ի', 'J': 'Ջ', 'K': 'Կ', 'L': 'Լ', 'M': 'Մ', 'N': 'Ն', 'O': 'Օ', 'P': 'Պ', 'Q': '՜', 'R': 'Ր', 'S': 'Ս', 'T': 'Տ', 'U': 'Ւ', 'V': 'Վ', 'W': 'ղ', 'X': 'Խ', 'Y': 'Գ', 'Z': 'Զ', '`': '՚', 'a': 'ա', 'b': 'բ', 'c': 'ե', 'd': 'դ', 'e': 'է', 'f': 'ֆ', 'g': 'գ', 'h': 'հ', 'i': 'ի', 'j': 'ջ', 'k': 'կ', 'l': 'լ', 'm': 'մ', 'n': 'ն', 'o': 'օ', 'p': 'պ', 'q': 'ձ', 'r': 'ր', 's': 'ս', 't': 'տ', 'u': 'ւ', 'v': 'վ', 'w': 'և', 'x': 'խ', 'y': 'յ', 'z': 'զ', '{': 'փ', '}': '՞', '¡': '՝', '£': 'չ', '¥': 'Ճ', '©': '©', 'ª': 'Փ', '¬': 'Ը', '®': 'ո', '°': 'ճ', '´': 'շ', 'µ': 'Չ', 'º': 'Ո', '¿': 'Ց', 'Æ': 'Շ', 'Ø': 'ց', 'ß': 'ք', 'æ': 'ծ', 'ø': 'Թ', 'π': 'Ք', '–': '–', '“': '«', '”': '»', '‹': '՛', '™': 'ռ', 'Ω': 'Ծ', '∂': 'Ձ', '∏': 'Ղ', '√': 'Ժ', '∞': 'թ', '∫': 'Ռ', '≤': 'ը', '≥': 'ժ'}

with pdfplumber.open(pdf_path, repair=True) as pdf:
  lines = pdf.pages[0].extract_text_lines(use_text_flow=True)
  for line in lines:
    deconded_text = ''.join([armenian_character_map.get(character, character) for character in line['text']])
    print(line['text'].ljust(40), '|', deconded_text)

PDF file

come-come-ye-saints_hy.pdf

Expected behavior

Lines should be returned in this order:

Խոսք՝ Վիլյամ Քլեյթըն 1814–1879 Երաժշտություն՝ Անգլիական ժողովրդական երգ Վարդապետություն և Ուխտեր 61։36–39 Վարդապետություն և Ուխտեր 59։1–4

Actual behavior

Lines are returned in this order:

Վիլյամ Քլեյթըն 1814–1879 Անգլիական ժողովրդական երգ Խոսք՝ Երաժշտություն՝ Վարդապետություն և Ուխտեր 61։36–39 Վարդապետություն և Ուխտեր 59։1–4

Screenshots

Image

samuelbradshaw avatar Apr 12 '25 01:04 samuelbradshaw

Hmm, perhaps I'm misunderstanding the issue raised here, but the purpose of use_text_flow=True is to parse the text by following the text's internal representation order rather than strictly the text's position on the page. So I would, in fact, expect that on some PDFs the line order is different than what you see on the page. If line order is important and use_text_flow=True is important for another reason, then one could use .extract_words(...) or . extract_text_lines(...), and then use the returned objects' position to reorder.

jsvine avatar Apr 21 '25 03:04 jsvine

Thanks! Is there a way to verify the order of the text in the PDF's internal representation? I assumed it was something to do with the word grouping after extraction, similar to https://github.com/jsvine/pdfplumber/issues/1279.

samuelbradshaw avatar Apr 21 '25 16:04 samuelbradshaw

my code get it ... if you have windows you can try ... or go line by line throug my code ^^

https://github.com/kalle07/parsing

kalle07 avatar May 17 '25 14:05 kalle07

Thanks! Is there a way to verify the order of the text in the PDF's internal representation? I assumed it was something to do with the word grouping after extraction, similar to #1279.

@samuelbradshaw, could you expand on what you mean by "verify the order"? In any case, one possible answer: Perhaps the simplest way would be to examine page.chars, which lists all characters in the order the PDF's commands describe them.

jsvine avatar Jun 12 '25 02:06 jsvine