pdfminer
pdfminer copied to clipboard
wrong case used in text from pdf
Using test document attached to issue #264 some (but not all) text is extracted with the wrong case.This results in mixed case words, some times with UPPERCASE letters in the middle of words.
My hunch is that this is related to the ToUnicode mapping table embedded in the PDF. However two other tools do not exhibit this behavior:
- Adobe Acrobat Reader DC version 2009.021.20049
- xpdf pdftotext xpdf-tools-win-4.02.zip from https://www.xpdfreader.com/download.html
Here is a sample from the first page (table of contents also exhibits the problem but first page is smaller):
pdftotext.exe -f 1 -l 1 -enc UTF-8 FightingGamePrimer.pdf
py -3 pdf2txt.py -p 1 -o FightingGamePrimer_pdfminer.txt FightingGamePrimer.pdf
Inlining the results below to make it easier to read (FF/FormFeed manually removed):
xpdf pdftotext FightingGamePrimer.txt
From Masher to Master: The Educated Video Game Enthusiast’s
Fighting Game Primer
(Super Book Edition)
Presented by Shoryuken.com
By Patrick Miller
pdfminer pdf2txt FightingGamePrimer_pdfminer.txt
From masher to master:
the educated Video Game enthusiast’s
FiGhtinG Game Primer
(Super Book edition)
Presented by
shoryuken.com
by Patrick miller
the formatting/newline/whitespace differences are not interesting, its the case that is significant.
E.g. "From Masher to Master" (expected, matches PDF) compared with "From masher to master". This is minor
Also "FiGhtinG" which is actually hard to read.
Any tips on how I can debug/fix this?