pdfminer icon indicating copy to clipboard operation
pdfminer copied to clipboard

wrong case used in text from pdf

Open clach04 opened this issue 5 years ago • 0 comments

Using test document attached to issue #264 some (but not all) text is extracted with the wrong case.This results in mixed case words, some times with UPPERCASE letters in the middle of words.

My hunch is that this is related to the ToUnicode mapping table embedded in the PDF. However two other tools do not exhibit this behavior:

  • Adobe Acrobat Reader DC version 2009.021.20049
  • xpdf pdftotext xpdf-tools-win-4.02.zip from https://www.xpdfreader.com/download.html

Here is a sample from the first page (table of contents also exhibits the problem but first page is smaller):

pdftotext.exe -f 1 -l 1 -enc UTF-8 FightingGamePrimer.pdf

py -3 pdf2txt.py -p 1 -o FightingGamePrimer_pdfminer.txt FightingGamePrimer.pdf

Inlining the results below to make it easier to read (FF/FormFeed manually removed):

xpdf pdftotext FightingGamePrimer.txt

From Masher to Master: The Educated Video Game Enthusiast’s
Fighting Game Primer
(Super Book Edition)

Presented by Shoryuken.com

By Patrick Miller

pdfminer pdf2txt FightingGamePrimer_pdfminer.txt

From masher to master: 

the educated Video Game enthusiast’s 

FiGhtinG Game Primer 

(Super Book edition)

Presented by 

shoryuken.com

by Patrick miller

the formatting/newline/whitespace differences are not interesting, its the case that is significant.

E.g. "From Masher to Master" (expected, matches PDF) compared with "From masher to master". This is minor

Also "FiGhtinG" which is actually hard to read.

Any tips on how I can debug/fix this?

clach04 avatar Nov 09 '19 22:11 clach04