sumatrapdf icon indicating copy to clipboard operation
sumatrapdf copied to clipboard

Problem copying text from PDFs

Open nickbe opened this issue 9 years ago • 11 comments

Attached is a PDF example where I cannot properly copy text. The PDF has been tested with Foxit and Adobe and there seems to be no problem in the PDF itself. But Sumatra copies a space after each character and looses the original spaces on the way. This looks like this:

l a n g u a g e s c n i e n c c s i s g r a m m a t i c a l o r u n g r a m m a r i r o l . W e a t t e m p t t u t r a i n n e u r a l n e t w o r k , w i t h o u t t h e b i f u r c a - t i o n i n t o l e a m u d v s . i n n a t e c o m p o l r a n a s s u m e d b y C l i o m < l < y , t o p r o d u c c t h e s o m e ) u d g m e n t s u s n a t i v e s p e a k e r s o n s h a r p l y g m m m n r i c a l / u n g r a m m a i i c n i d a r n . O n l y r e c u r r e n t l w u r a l n e t w o r k s a r e i n v e s t i g a t e d

output.pdf

nickbe avatar Jun 01 '16 17:06 nickbe

Also unable to search for words in the pdf. I am encountering similar issue with attached. Possibly related to #547 ? example.pdf

edubya avatar Aug 04 '16 19:08 edubya

pdf.js shows similar issues, although text is garbled worse in SumatraPDF.

So far only noticed for pdfs generated by tesseract (#382) on Windows.

edubya avatar Aug 05 '16 19:08 edubya

I think this is related to issue #373 as well.

ebogaard avatar Aug 09 '16 09:08 ebogaard

@ebogaard Did you mean tesseract #373 ?

edubya avatar Aug 09 '16 10:08 edubya

Whoops, my mistake. Yes, I meant the one from tesseract.

ebogaard avatar Aug 09 '16 10:08 ebogaard

A shot in the dark here, but I think it has to do with the "glyphLessFont" used as an "overlay" for the PDF. Other OCR engines use a hidden underlay of regular fonts, Times, Arial, etc.. I have the same "space between characters" problem that ruins searching. (And I know that font over image PDF works in other viewers.)

J-P- avatar Aug 12 '16 21:08 J-P-

In the tesseract-project I referred to this issue, so I hope they're reading this suggestion. But as far as I know, they're working on a fix already. I'm not sure what that fix entails, though.

ebogaard avatar Aug 16 '16 11:08 ebogaard

Any update yet? Using ocr and Sumatra together is no fun at the moment.

firesoft-de avatar Sep 27 '19 08:09 firesoft-de

This type of file was a problem with 3.1.2 Newer SumatraPDF pre-release has less problems with OCR output. Still not perfect but better when the OCR editor is outputting simple text.

Here I simply viewed the "example.pdf" above in PDFXedit told it to redo a basic OCR with tesseract and here is the result of copy from SumatraPDF to Notepad. There are two very minor space related discrepancies but spaces between characters is not one of them 544 example OCR using tesseract.pdf .

image

  1. A line feed has been introduced between the Page Number and Header This appears to be from SumatraPDF since it is not introduced in a copy from Adobe acrobat

  2. There is no paragraph indent. which is normal as it is the same in a copy from Adobe acrobat

GitHubRulesOK avatar Feb 03 '20 02:02 GitHubRulesOK

@kjk I am unsure how much the different problems generated by OCR output are One common problem as described for first example is a space between characters causing words to become unsearchable. A second common problem is that a larger than normal space between words (usually when justified) causes an effect of extra linefeeds or paragraph spaces. Here is an example where copied words appear on each line book1.pdf

GitHubRulesOK avatar Mar 15 '20 22:03 GitHubRulesOK

Just a comment that the sample related to this thread really is so poor the spaces issue is not as bad as the very poor text output (as shown below in a text extraction). The best result would be to remove the bad OCR and replace with better.

Image

GitHubRulesOK avatar Nov 23 '25 02:11 GitHubRulesOK