Text flow in sheet music PDFs
Describe the bug
I'm running into issues with text flow when extracting text from PDF sheet music with Page.extract_words(). PDFs created by music notation software like Finale or MuseScore arrange the lyric syllables in columns, but the natural reading order is by row. To extract lyric syllables by row, I'm using use_text_flow = False. However, the extracted words aren't ordered correctly.
Have you tried repairing the PDF?
Yes (issue still present)
Code to reproduce the problem
import pdfplumber
import pprint
pdf_path = '/path/to/file.pdf'
music_notation_fonts = ['EngraverTextT', 'Maestro', 'DingPI']
with pdfplumber.open(pdf_path) as pdf:
extracted_words_by_page = []
for page in pdf.pages:
filtered_words = []
words = page.extract_words(use_text_flow = False, extra_attrs = ['fontname', 'size'])
for word in words:
# Skip text from music notation fonts
if any([font in word['fontname'] for font in music_notation_fonts]):
continue
filtered_words.append(word)
extracted_words_by_page.append(filtered_words)
pprint.pp(extracted_words_by_page)
PDF file
https://assets.churchofjesuschrist.org/2e/68/2e686294c65060b706e80164552ffe7fec02abc9/redeemer_of_israel.pdf
Expected behavior
For the above PDF, words should be extracted in this order:
Redeemer of Israel 6
Confidently = 84–100
1. Re - deem - er of Is - rael, Our on - ly de - light, On
2. We know he is com - ing To gath - er his sheep And
3. How long we have wan - dered As stran - gers in sin And
4. As chil - dren of Zi - on, Good tid - ings for us. The
etc.
Actual behavior
Words were extracted in this order:
1. Re 2. We 3. How 4. As Confidently = 84–100 deem know long chil
etc.
Screenshots
N/A
Environment
- pdfplumber version: 0.11.5
- Python version: 3.11.8
- OS: macOS 15.2 Sequoia
Additional context
As a bonus, I would love it if pdfplumber had a setting to automatically detect sheet music and correct the flow of inline lyrics, while still handling the flow of text above and below the sheet music (which may have columns, for example, as in this PDF).
Thank you for providing the PDF and reproducible code, @samuelbradshaw! My hunch is that this relates to the grouping on size (of which there are a few variations very close to 9) and fontname, because when I skip that part and instead filter out the musical notation with page.filter(...), I get the ordering I think you'd expect:
music_notation_fonts = ['EngraverTextT', 'Maestro', 'DingPI']
def test(obj):
return "PalatinoldsLat" in obj.get("fontname", "PalatinoldsLat")
filtered = pdf.pages[0].filter(test)
words = filtered.extract_words()
for word in words:
print(word["text"])
Redeemer
of
Israel
6
Confidently
=
84–100
1.
Re
-
deem
-
er
of
Is
-
rael,
Our
on
-
ly
de
-
light,
On
2.
We
know
he
is
com
-
...