Text flow in sheet music PDFs

Open samuelbradshaw opened this issue 11 months ago • 1 comments

Describe the bug

I'm running into issues with text flow when extracting text from PDF sheet music with Page.extract_words(). PDFs created by music notation software like Finale or MuseScore arrange the lyric syllables in columns, but the natural reading order is by row. To extract lyric syllables by row, I'm using use_text_flow = False. However, the extracted words aren't ordered correctly.

Have you tried repairing the PDF?

Yes (issue still present)

Code to reproduce the problem

import pdfplumber
import pprint

pdf_path = '/path/to/file.pdf'

music_notation_fonts = ['EngraverTextT', 'Maestro', 'DingPI']

with pdfplumber.open(pdf_path) as pdf:
  extracted_words_by_page = []
  for page in pdf.pages:
    filtered_words = []
    words = page.extract_words(use_text_flow = False, extra_attrs = ['fontname', 'size'])
    for word in words:
      # Skip text from music notation fonts
      if any([font in word['fontname'] for font in music_notation_fonts]):
        continue
      filtered_words.append(word)
    extracted_words_by_page.append(filtered_words)

pprint.pp(extracted_words_by_page)

PDF file

https://assets.churchofjesuschrist.org/2e/68/2e686294c65060b706e80164552ffe7fec02abc9/redeemer_of_israel.pdf

Expected behavior

For the above PDF, words should be extracted in this order: Redeemer of Israel 6 Confidently = 84–100 1. Re - deem - er of Is - rael, Our on - ly de - light, On 2. We know he is com - ing To gath - er his sheep And 3. How long we have wan - dered As stran - gers in sin And 4. As chil - dren of Zi - on, Good tid - ings for us. The etc.

Actual behavior

Words were extracted in this order: 1. Re 2. We 3. How 4. As Confidently = 84–100 deem know long chil etc.

Screenshots

N/A

Environment

pdfplumber version: 0.11.5
Python version: 3.11.8
OS: macOS 15.2 Sequoia

Additional context

As a bonus, I would love it if pdfplumber had a setting to automatically detect sheet music and correct the flow of inline lyrics, while still handling the flow of text above and below the sheet music (which may have columns, for example, as in this PDF).

Jan 19 '25 09:01 samuelbradshaw

Thank you for providing the PDF and reproducible code, @samuelbradshaw! My hunch is that this relates to the grouping on size (of which there are a few variations very close to 9) and fontname, because when I skip that part and instead filter out the musical notation with page.filter(...), I get the ordering I think you'd expect:

music_notation_fonts = ['EngraverTextT', 'Maestro', 'DingPI']
def test(obj):
    return "PalatinoldsLat" in obj.get("fontname", "PalatinoldsLat")
filtered = pdf.pages[0].filter(test)
words = filtered.extract_words()

for word in words:
    print(word["text"])

Redeemer
of
Israel
6
Confidently
=
84–100
1.
Re
-
deem
-
er
of
Is
-
rael,
Our
on
-
ly
de
-
light,
On
2.
We
know
he
is
com
-
...

Feb 09 '25 22:02 jsvine