tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Invisible glyph bounds at wrong positions in PDF

Open THausherr opened this issue 4 years ago • 44 comments

Environment

  • Tesseract Version: Tesseract Open Source OCR Engine v5.0.0.20190623 with Leptonica (downloaded from https://digi.bib.uni-mannheim.de/tesseract/ )
  • Platform: W7 64 bit 6.1.7601

Call:

"C:\Program Files\Tesseract-OCR\tesseract" scan.tif scan-ocr pdf

Current Behavior:

text bounds are not identical to visible glyphs in Adobe Reader. Example:

grafik

Expected Behavior:

text bounds should be identical to visible glyphs in Adobe Reader. In the graphic, the blue color should cover the "n".

Suggested Fix:

I suspect that the /W array is missing in the font dictionary: grafik So Adobe will use the /DW 500 entry (screenshot from PDF 32000 specification): grafik

scan-ocr.pdf scan.tif.zip

THausherr avatar Feb 06 '20 12:02 THausherr

Interesting suggestion. If correct, why would it show up as an n - 1 problem in highlighting?

jbreiden avatar Feb 06 '20 13:02 jbreiden

Sorry, I don't understand what you mean. My argument is that the highlights widths don't match. Adobe gets these from the font data, and widths are different in a proportional font. And it isn't just the "n". When trying to highlight the "I" it looks like this: grafik

THausherr avatar Feb 06 '20 13:02 THausherr

The glyphless font deliberately uses equal width for every character. I stretch the the word using Tz in the PDF to make it fit. So I expect word highlighting to look correct, but not character highlighting within a word. This design was chosen to maximize compatibility across all the scripts supported by Tesseract while minimizing complexity.

jbreiden avatar Feb 06 '20 14:02 jbreiden

I had a look with the glyph contour display of PDFBox and there it matches the word bounds: grafik

So maybe Adobe is to blame, but users will of course see this differently :-(

THausherr avatar Feb 06 '20 14:02 THausherr

I think I found a bit more... "Introduction" has 12 characters but looks like this in the PDF content stream: 1 0 0 1 77.76 738.16 Tm /f-0-0 11 Tf 107.076 Tz [ <0049006E00740072006F00640075006300740069006F006E0020> ] TJ this is 13 characters. The last one (0020) is a space. This space is positioned over the final "n".

THausherr avatar Feb 06 '20 14:02 THausherr

When removing "3 Tr" so that the "invisible" font gets visible, it looks like this: grafik This is really 13 characters. For some reason, Adobe doesn't want to mark the final space.

THausherr avatar Feb 06 '20 14:02 THausherr

I just see that the PDFBox screenshot shows it too: "ISO" has 4 characters, "32000" has 6 characters.

Maybe the original idea was to put the space there for text extraction? However it isn't needed, good text extractors "imagine" the space from the position differences.

If the space character is needed, then it should be positioned over the actual space.

THausherr avatar Feb 06 '20 14:02 THausherr

https://github.com/tesseract-ocr/tesseract/issues/1900

amitdo avatar Feb 06 '20 18:02 amitdo

Thanks, after reading that one, I think this issue is also somewhat duplicate of https://github.com/jbarlow83/OCRmyPDF/issues/450 .

THausherr avatar Feb 06 '20 19:02 THausherr

You should check the bounding box of the whole word 'Introduction' with the hocr format. Does it also end before the last glyph?

amitdo avatar Feb 06 '20 20:02 amitdo

Tesseract's recognizer just finds words, and doesn't tell us anything about spaces. Which makes sense: how would an OCR program know if there is one space, two spaces, etc? We add the space in during PDF generation to help some viewer with copy-paste; otherwise it is common for words to run together. Apple's viewer is notorious for this. I'm a little reluctant to put a space outside the word bounding box - there is no guarantee there will be room for it, and I don't really want the PDF output module to get into the layout analysis game. One possibility might be to play with the font such that U+0020 gets zero (or non-zero) width, while every other character maintains the same fixed width we've always had. Then adjust the Tz word stretch appropriately.

https://github.com/tesseract-ocr/tesseract/blob/master/src/api/pdfrenderer.cpp#L471

I haven't touched the font in a while, so not sure how easy it is to make a change like this. If you want to play with this yourself, I recommend using the program "ttx" from fonttools to transform the font into an XML file. Edit the file, then transform it back. I have a feeling it won't be trivial but it might be possible. See also the design discussion at the top of pdfrenderer.cpp, which explains how everything works.

jbreiden avatar Feb 07 '20 01:02 jbreiden

Yeah I understand that this feature was implemented to "help" low quality text extractors.

How about making the feature configurable for PDF? IMHO the majority user expectation is whatever Adobe does, that is the gold standard.

Zero width space also sounds like an interesting idea to explore. You probably have to add appropriate /W entries.

(The reason I created this issue: we're using a commercial OCR tool on a project that grows fast. The OCR is fine, but licensing is a pain, it doesn't use all CPU cores, and the logging is almost non existent, the whole thing is a black box, so I was thinking about replacing it with tesseract, but before we discuss this with the client I need to be sure the client would be satisfied and that its clients too)

THausherr avatar Feb 07 '20 05:02 THausherr

@amitdo The bounding box is correct:

   <div class='ocr_carea' id='block_1_2' title="bbox 324 400 643 442">
    <p class='ocr_par' id='par_1_2' lang='eng' title="bbox 324 400 643 442">
     <span class='ocr_line' id='line_1_2' title="bbox 324 400 643 442; baseline 0 -1; x_size 47.393444; x_descenders 6.3934426; x_ascenders 11">
      <span class='ocrx_word' id='word_1_3' title='bbox 324 400 643 442; x_wconf 95'>Introduction</span>
     </span>
    </p>
   </div>

THausherr avatar Feb 07 '20 09:02 THausherr

Adobe Acrobat is not as popular as it used to be 10 years ago.

Default PDF viewers:

  • Windows 10 - Chromium based Edge - Pdfium
  • macOS - Preview
  • ChromeOS - Pdfium
  • Chromium / Chrome / Edge - Pdfium
  • Firefox - pdf.js
  • Linux - Evince/Okular (Poppler)

So most users will use the OS/browser's built-in PDF viewers, which is not Adobe's viewer.

The best solution is to find a method that will work on all these viewers, without a special parameter for specific viewer.

amitdo avatar Feb 07 '20 14:02 amitdo

I tested your pdf file with Chromium (pdfium), Firefox (pdf.js) and Evince (poppler).

The words bounding boxes look very good when the page is viewed with pdfium/pdf.js.

Poppler suffers from the same issue you raised above combined with a 'zebra effect'.

amitdo avatar Feb 07 '20 15:02 amitdo

With PDF.js on firefox, double click marks the whole word, when I mark the final "n", I get a space.

With Chrome, double click shows the same effect than with Adobe Reader.

With MS Edge, same effect than with PDF.js.

THausherr avatar Feb 08 '20 05:02 THausherr

I took a look at the code. It looks like one can pretty easily remap U+0020 to an alternate glyph in the cidtogmap. It's been five years since the last significant change, and my memory is terrible, but I I'm confident we currently map everything down to a single "glyph" in the font. That slightly misleading code at line 549 of pdfrenderer.cpp is just filling out the 2 byte entries one byte at a time.

So then there's the question of adding a another glyph to the font. The design notes from Ken say we've got an unused glyph at index 0. Unused because it gives heartburn to the Adobe parser. And then one at index one which is used everywhere. It's not quite trivial, but I don't yet see any reason we can't add another entry at index 2 that is identical or near to the entry in index 1. This means tranforming the font to xml using ttx from fonttools, doing some careful copy pasting, transforming it back, and hoping nothing too scary happens.

Next there is the question of assigning the zero width (or near zero width) to just that new entry. As of right now, I'm not sure exactly how to do that. But I think Tilman's suggestion of adding a /W array to the /CIDFont dictionary is the first thing to try. (Currently line 526 in pdfrenderer.cpp). There's probably spot inside the font as well to specify width, that we'll want to also set, for consistency, compatibility, and minimal confusion.

Finally, I already mentioned that the bounding box stretch can be computed without considering the U+0020, which is basically removing line 471 from pdfrenderer.cpp. After that - if it works at all - then just compatibility testing with various renderers.

I really don't know if this will work or not, but there's a chance, and it's my best suggestion for what to try. Might make sense to contact Ken Sharp and see if he has an opinion on the topic. Tilman, I know it's a lot of work but if you want to try this, you will probably get it done significantly faster than me. (Unlike 5 years ago, my day job does not currently intersect with PDF. That doesn't totally stop me, but it does slow things down quite a lot.)

jbreiden avatar Feb 08 '20 06:02 jbreiden

Thanks for the nice comment; my problem is that I haven't done C/C++ for almost 10 years except maintenance of my existing software. I don't even have a dev system up that supports current language standards so I would have to install / understand / learn that first. However I'll keep it this issue in mind when I have more time at work (because this is a work issue).

THausherr avatar Feb 09 '20 14:02 THausherr

@jbarlow83, maybe you can help us here.

amitdo avatar Feb 09 '20 18:02 amitdo

I'll spend a little time right now and see what I can do.

jbreiden avatar Feb 09 '20 18:02 jbreiden

I tried the simplest thing possible, leaving the font alone and trying to use that glyph at index 0. I expected Adobe Reader to completely choke, and Pdfium/Chrome to work great. Instead, my ancient copy of Adobe Reader 9.5.5 (e.g. the one for Linux) works fine. However, Pdfium/Chrome is highlighting beyond the end of the word. That's what you would expect if Pdfium was ignoring the zero width on index 0.

--- pdfrenderer.cpp.orig 2019-07-07 08:23:24.000000000 -0700 +++ pdfrenderer.cpp 2020-02-09 11:18:40.578544848 -0800 @@ -535,6 +536,7 @@ " /Subtype /CIDFontType2\n" " /Type /Font\n" " /DW " << (1000 / kCharWidth) << "\n"

  • " /W [ 0 [0 500] ]\n" ">>\n" "endobj\n"; AppendPDFObject(stream.str().c_str()); @@ -546,6 +548,8 @@ for (int i = 0; i < kCIDToGIDMapSize; i++) { cidtogidmap[i] = (i % 2) ? 1 : 0; }
  • const int kSpaceCID = 20;
  • cidtogidmap[kSpaceCID * 2 + 1] = 0; size_t len; unsigned char *comp = zlibCompress(cidtogidmap.get(), kCIDToGIDMapSize, &len); stream.str("");

jbreiden avatar Feb 09 '20 19:02 jbreiden

Alternative: (tesseract hocr) + (hocr-pdf (https://github.com/ImageProcessing-ElectronicPublications/hocr-tools)).

zvezdochiot avatar Feb 09 '20 19:02 zvezdochiot

Tried modifying the font to add a specific entry for U+0020. Same results, Adobe good, pdfium bad. This is the point where I pause, and people take a look for mistakes. If nobody finds anything, the next step is probably asking for help. That's Ken Sharp about the overall approach & especially the font, and Pdfium folks to help debug why the /W entry does not appear to be honored.

--- pdfrenderer.cpp.orig	2019-07-07 08:23:24.000000000 -0700
+++ pdfrenderer.cpp	2020-02-09 12:00:57.961541649 -0800
@@ -468,7 +468,6 @@
     } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
     if (res_it->IsAtBeginningOf(RIL_WORD)) {
       pdf_word += "0020";
-      pdf_word_len++;
     }
     if (word_length > 0 && pdf_word_len > 0) {
       double h_stretch =
@@ -535,6 +536,7 @@
     "  /Subtype /CIDFontType2\n"
     "  /Type /Font\n"
     "  /DW " << (1000 / kCharWidth) << "\n"
+    "  /W [ 1 [500 1] ]\n"
     ">>\n"
     "endobj\n";
   AppendPDFObject(stream.str().c_str());
@@ -544,8 +546,11 @@
   const std::unique_ptr<unsigned char[]> cidtogidmap(
       new unsigned char[kCIDToGIDMapSize]);
   for (int i = 0; i < kCIDToGIDMapSize; i++) {
-    cidtogidmap[i] = (i % 2) ? 1 : 0;
+    cidtogidmap[i] = (i % 2) ? 0x01 : 0x00;
   }
+  const int kSpaceCID = 20;
+  cidtogidmap[kSpaceCID * 2] = 0x00;
+  cidtogidmap[kSpaceCID * 2 + 1] = 0x02;
   size_t len;
   unsigned char *comp = zlibCompress(cidtogidmap.get(), kCIDToGIDMapSize, &len);
   stream.str("");

debug.pdf font.zip

jbreiden3 avatar Feb 09 '20 20:02 jbreiden3

@amitdo I will look.

I'd consider using a separate Tz for the trailing space rather than modifying the font.

1.0 Tz [ <0049006E00740072006F00640075006300740069006F006E> ] TJ 0.001 Tz [ <0020> ] TJ

Seems like it would be simpler and less reliant on fonts being parsed correctly.

However I do think some artifact of the glyphlessfont is causing trouble, since using a hidden Arial (e.g. the hOCR transform method) does not have these problems for the same content stream.

jbarlow83 avatar Feb 09 '20 20:02 jbarlow83

The /W entry as it is now grafik means CID 1 has a width of 500, CID 2 has a width of 1. I assume that all others have default width (500). If you wanted to change the width of space, then you should have done something for CID 32.

THausherr avatar Feb 09 '20 20:02 THausherr

You are correct. Result works on both Acroread & Pdfium. File attached and ready for compatibility testing. If nobody finds trouble, I'm comfortable submitting. This variant makes no changes to the font, and sets the width of space to zero.

--- pdfrenderer.cpp.orig	2019-07-07 08:23:24.000000000 -0700
+++ pdfrenderer.cpp	2020-02-09 13:26:33.743553816 -0800
@@ -468,7 +468,6 @@
     } while (!res_it->Empty(RIL_BLOCK) && !res_it->IsAtBeginningOf(RIL_WORD));
     if (res_it->IsAtBeginningOf(RIL_WORD)) {
       pdf_word += "0020";
-      pdf_word_len++;
     }
     if (word_length > 0 && pdf_word_len > 0) {
       double h_stretch =
@@ -535,6 +536,7 @@
     "  /Subtype /CIDFontType2\n"
     "  /Type /Font\n"
     "  /DW " << (1000 / kCharWidth) << "\n"
+    "  /W [ 32 [0] ]\n"
     ">>\n"
     "endobj\n";
   AppendPDFObject(stream.str().c_str());
@@ -544,8 +546,11 @@
   const std::unique_ptr<unsigned char[]> cidtogidmap(
       new unsigned char[kCIDToGIDMapSize]);
   for (int i = 0; i < kCIDToGIDMapSize; i++) {
-    cidtogidmap[i] = (i % 2) ? 1 : 0;
+    cidtogidmap[i] = (i % 2) ? 0x01 : 0x00;
   }
+  const int kSpaceCID = 0x0020;
+  cidtogidmap[kSpaceCID * 2] = 0x00;
+  cidtogidmap[kSpaceCID * 2 + 1] = 0x00;
   size_t len;
   unsigned char *comp = zlibCompress(cidtogidmap.get(), kCIDToGIDMapSize, &len);
   stream.str("");

testme1.pdf

jbreiden3 avatar Feb 09 '20 21:02 jbreiden3

@jbarlow83 The problem with hidden Arial is coverage. Tesseract supports the entire basic multilingual plane and beyond. The glyphless font is equally happy with Cherokee and English.

jbreiden3 avatar Feb 09 '20 21:02 jbreiden3

Chromium, Evince - the page looks good. Firefox - no effect, the issue still exists.

amitdo avatar Feb 09 '20 21:02 amitdo

Thank you, I just tested with "testme1.pdf" and double clicking "Introduction" on firefox works fine. But the same for "digital" highlights at the wrong place, even more for "enable" but this may be a firefox bug. I think I reported such a bug myself a long time ago. PDFBox shows correct glyph bounds.

Chrome and Edge work fine.

However PDFBox reports a warning "No glyph for code 32 (CID 0020) in font GlyphLessFont", which didn't happen with the original file. But this is just a minor inconvenience.

THausherr avatar Feb 10 '20 04:02 THausherr

Firefox has never worked well with Tesseract PDF. Does this change make it worse?

https://github.com/mozilla/pdf.js/issues/6509

https://github.com/mozilla/pdf.js/issues/6863

I suppose we should also check samples in vertical Japanese, right to left Andrabic, and bidirectional to see if there are any regressions. Plus PDF parsers running on non Linux operating systems.

jbreiden3 avatar Feb 10 '20 05:02 jbreiden3