gImageReader icon indicating copy to clipboard operation
gImageReader copied to clipboard

Investigate complex scripts in PoDoFo PDF export

Open Shreeshrii opened this issue 7 years ago • 59 comments

The pdf output is not correct for Devanagari script when using the 3.2.3 experimental version for tesseract 4.0.0alpha.

Please see attached zip file with input image, text, hocr and pdf output.

If I copy the text from pdf and paste in notepad++, the rendering is correct. However rendering in the pdf file itself is incorrect.

skanda700test.zip

Shreeshrii avatar Jan 11 '18 11:01 Shreeshrii

I fear this is a general issue with PoDoFo and complex scripts - resp more work is needed have PoDoFo handle these correctly.

manisandro avatar Jan 11 '18 11:01 manisandro

Actually, isn't it just a matter of picking the right font? I tried with a test image you sent me a while ago, installed the Lohit Devanagari font, selected that font for PDF export, and the output looks reasonable (from what I can judge), see attachment.

devanagari-text.pdf

manisandro avatar Jan 11 '18 14:01 manisandro

It should work correctly with any Devanagari Unicode font.

The problem is not the font, rather it is the complex script rendering. In Devanagari there is reordering of of certain combining marks. Also, multiple consonants together give rise to different glyphs.

PoDoFo exported pdf has letters overlapping each other. The combining mark for i maatraa is not getting reordered to before the consonant - see lines 2 and 3.

I copied the text from the pdf you posted above into notepad++ and then printed it as pdf (in Lohit Devanagari font) so that it is easy to compare.

Please see attached.

devanagari-text-lohit-notepad.pdf

Shreeshrii avatar Jan 11 '18 14:01 Shreeshrii

Ah I see. Do you have any idea how tesseract handles this?

manisandro avatar Jan 11 '18 14:01 manisandro

I think Cairo, Pango, Harfbuzz etc provide the support.

I had done a search in podofo archives earlier today, the only ref I found related to this is in the thread https://sourceforge.net/p/podofo/mailman/message/32425071/ As of 2014, it seemed that podofo did not support this.

Shreeshrii avatar Jan 11 '18 14:01 Shreeshrii

Yeah I read the same thread - as I read it, PoDoFo isn't capable of handling it for you, but it should be possible to handle it with custom code outside of PoDoFo.

manisandro avatar Jan 11 '18 14:01 manisandro

But looking at the tesseract source, in particular pdfrenderer.cpp, I see no traces of pango or harfbuzz. It would be sufficient to figure out the low-level blocks that tesseract adds to the PDF, I can then just also write low-level blocks via PoDoFo instead of using the DrawText method I suppose.

manisandro avatar Jan 11 '18 15:01 manisandro

https://github.com/phuang/pango https://www.cairographics.org/

Take a look at stringrenderer

https://github.com/tesseract-ocr/tesseract/blob/c773eb5784a9b895008240f23054d2ff916786a5/training/stringrenderer.cpp

Shreeshrii avatar Jan 11 '18 15:01 Shreeshrii

Okay I'll take a look when I find a moment.

manisandro avatar Jan 11 '18 15:01 manisandro

Maybe it will work better with the Qtprinter.

http://doc.qt.io/qt-5/internationalization.html

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jan 11, 2018 at 8:45 PM, Sandro Mani [email protected] wrote:

Okay I'll take a look when I find a moment.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/manisandro/gImageReader/issues/291#issuecomment-356963831, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oyT8NRRi7iams6ovQ_L5M2E10ZQgks5tJiV6gaJpZM4Rat5f .

Shreeshrii avatar Jan 11 '18 15:01 Shreeshrii

@Shreeshrii I've added a QPrinter backend for PDF export, please give it a try.

manisandro avatar Feb 01 '18 00:02 manisandro

@manisandro Thanks for addressing this issue. Do you have a windows binary that I can test? I am on windows 10.

Shreeshrii avatar Feb 01 '18 03:02 Shreeshrii

I tried with a test image you sent me a while ago, installed the Lohit Devanagari font, selected that font for PDF export, and the output looks reasonable (from what I can judge), see attachment.

If it is not possible to provide the windows binary now, please create the test output as you had done before.

Shreeshrii avatar Feb 01 '18 08:02 Shreeshrii

Here you go:

  • 32 bit: https://smani.fedorapeople.org/tmp/gImageReader_3.2.3_qt5_i686.exe
  • 64 bit: https://smani.fedorapeople.org/tmp/gImageReader_3.2.3_qt5_x86_64.exe

manisandro avatar Feb 01 '18 09:02 manisandro

Thanks!! It is working great. I tested with Devanagari, san (Sanskrit) and Gurmukhi traineddata files.

I am attaching input files and pdfs from the test.

siddhanta.pdf siddhanta

hin-eng hin-eng.pdf

Shreeshrii avatar Feb 01 '18 11:02 Shreeshrii

Two unrelated items that I noticed:

  1. In HOCR mode, it is not possible to select a section of image for processing. The selection crosshair is displayed but it does not do any selection.

  2. If selecting podofo printer backend, pdf is not created/is zero size/locks the pdf file in some manner. If qtprinter is selected after that, pdf file is not allow to be opened.

Good to see the export to odt option (is this a new feature?).

Shreeshrii avatar Feb 01 '18 11:02 Shreeshrii

  1. Correct, hOCR is always page based (due to the nature of the hOCR format). While clearly a subset of a document can also be seen as a hOCR page, things get complicated when you have to start merging hOCR documents which represent different portions of the same image.
  2. Need to investigate, might be a regression with the code I introduced last night.

Yes, ODT is indeed new.

Overall testing is very much welcome since I'd like to push out a new release soon.

manisandro avatar Feb 01 '18 11:02 manisandro

  1. Then in HOCR mode the selection crosshair should not be displayed.

  2. I have not tested podofo with a english document, just with these complex script ones. At one time I saw a help text telling that qtprinter should be used for complex scripts - but not sure where the cursor was hovering at that point. I couldn't get it to display again.

Would it help to make qtprinter as the default choice showing up in export pdf dialogue for complex scripts?

Shreeshrii avatar Feb 01 '18 11:02 Shreeshrii

  1. Valid point
  2. Just click on the hint-icon next to the combobox
  3. I don't know how do reliably detect whether complex scripts are involved.

manisandro avatar Feb 01 '18 11:02 manisandro

When using `pdf with invisible text overlay', the pdf file size becomes much larger.

eg. using a 600dpi image of 512kb size. The resulting pdf is 1175kb with default setting of 300dpi in the export dialog.

hin-eng.pdf hin-eng

Shreeshrii avatar Feb 01 '18 12:02 Shreeshrii

Ah apropos, I see now that the windows build is missing some icons, hence why you i.e. can't see the hint icon.

manisandro avatar Feb 01 '18 12:02 manisandro

Size: that's the price of QPrinter. Nothing I can do about that. QPrinter internally hard-codes the image compression method to JPEG@94% quality.

manisandro avatar Feb 01 '18 12:02 manisandro

Please see http://doc.qt.io/qt-5/qimagewriter.html

Qt provides the QImageWriter class which supports setting format specific options, such as the gamma level, compression level and quality, prior to storing the image.

Shreeshrii avatar Feb 01 '18 12:02 Shreeshrii

By changing the image options in export `pdf with invisible text overlay' with qtprinter, the pdf size can be reduced.

I changed the settings from color to monchrome and dpi from 300 to 100.

The resulting pdf size is now 355 kb instead of 1175kb.

hin-eng.pdf

Shreeshrii avatar Feb 01 '18 12:02 Shreeshrii

Please see http://doc.qt.io/qt-5/qimagewriter.html

Sure, but QPrinter does not use QImageWriter

manisandro avatar Feb 01 '18 13:02 manisandro

You could offer option to change the printermode as part of export pdf dialog

enum PrinterMode { ScreenResolution, PrinterResolution, HighResolution }

Shreeshrii avatar Feb 01 '18 13:02 Shreeshrii

That enum has no effect since it is overridden by the resolution the user chooses.

manisandro avatar Feb 01 '18 13:02 manisandro

I changed the settings from color to monchrome and dpi from 300 to 100. The resulting pdf size is now 355 kb instead of 1175kb.

Changed format to grayscale instead of monochrome and 100 dpi, resulting pdf is 276kb.

Of course, without original image, the pdf size is much smaller, so could be made at 300 dpi.

Shreeshrii avatar Feb 01 '18 13:02 Shreeshrii

For monochrome you really need CCITT/FAX encoding to have a reasonably small file size, but as mentioned, it is not doable with QPrinter.

manisandro avatar Feb 01 '18 13:02 manisandro

Thanks!

Going back to the original issue report and current status:

originally with podofo, for Devanagari script

If I copy the text from pdf and paste in notepad++, the rendering is correct. However rendering in the pdf file itself is incorrect.

currently with podofo, for Devanagari script

pdf is not created/is zero size/locks the pdf file.

currently with qtprinter, for Devanagari script

The rendering in pdf preview and pdf file is correct. Overlapping character problem can be fixed by reducing the font size %. However, when I copy the text from pdf and paste in notepad++, the rendering is incorrect.

Shreeshrii avatar Feb 01 '18 14:02 Shreeshrii

The rendering in pdf preview and pdf file is correct. Overlapping character problem can be fixed by reducing the font size %. However, when I copy the text from pdf and paste in notepad++, the rendering is incorrect.

Well that sucks. I don't think there is anything I can do here... Again, it is QPainter internals.

manisandro avatar Feb 01 '18 14:02 manisandro

Assuming that the regression regarding podofo and Devanagari can be fixed, I think the best option might be to use

Podofo With invisible text layer pdf With the fax level compression for the image

That way, the visible part of pdf will be correct since it uses the original image.

And, the text layer will be correct (as per earlier test with podofo).

Shreeshrii avatar Feb 01 '18 14:02 Shreeshrii

PoDoFo can definitely be fixed, I'll test it on windows this evening and see what went wrong, I'll post a fresh test build as soon as I fixed things.

It is kinda odd though that the Devanagari script is correctly rendered using QPainter, but is wrong when copying.

manisandro avatar Feb 01 '18 14:02 manisandro

This is a known problem with most pdf writers for complex scripts. The glyphs for combined consonants, reordered combining marks do not get copied correctly from pdfs.

Xetex with its support for actual text renders it correctly, and so also PoDoFo, based on my earlier test.

Pdfs created by Open office, libre office also have same problems.

Shreeshrii avatar Feb 01 '18 15:02 Shreeshrii

complex script text can also be copied correctly from pdfs created by tesseract, which use the original image for the visual layer.

Shreeshrii avatar Feb 01 '18 15:02 Shreeshrii

About the PoDoFo locking issue: isn't it just that you have the output PDF open in a PDF viewer or such which is locking the file?

manisandro avatar Feb 01 '18 16:02 manisandro

With PoDoFo

Export to PDF dialog closes but there is no indication whether the export is completed.

When I look in File Manager, it shows a pdf of 0kb.

On refreshing File Manager after a while, pdf file shows up with a size.

When I double click to open it, Adobe Reader gives an error saying file in use or open in another application.

So, it seems to be locked by gimagereader.

Shreeshrii avatar Feb 01 '18 17:02 Shreeshrii

Are you creating a new file when exporting or overwriting an existing one? If the latter, are you sure that file isn't open in another application?

manisandro avatar Feb 01 '18 21:02 manisandro

I've updated the test builds with a couple of fixes, one might be related to the issue you are seeing. Links as usual:

  • 32 bit: https://smani.fedorapeople.org/tmp/gImageReader_3.2.3_qt5_i686.exe
  • 64 bit: https://smani.fedorapeople.org/tmp/gImageReader_3.2.3_qt5_x86_64.exe

manisandro avatar Feb 01 '18 22:02 manisandro

Thanks for the prompt test build. It is working fine now, i.e.

  1. HOCR mode, crosshair not being displayed.
  2. Hint-icon is being displayed next to combo-box.
  3. PoDoFo printer is NOT locking the pdf file.

For Devanagari, the export to pdf option that worked well for my test image:

  • PoDoFo for PDF export backend.
  • PDF with invisible text overlay for PDF export output mode.
  • Image Settings
  • Format Grayscale
  • DPI 100
  • Compression Jpeg
  • Compression quality 60

The generated pdf is 164 kb. Original image was 512kb at 600 dpi. The Devanagari text can be correctly copied and pasted as text. (If you want to test this for other complex scripts, export to TXT and export to PDF and copy and paste text from that and compare the two files).

Shreeshrii avatar Feb 02 '18 04:02 Shreeshrii

So you are using PoDoFo "PDF with invisible text overlayer" and QPrinter for regular PDF, right?

manisandro avatar Feb 02 '18 08:02 manisandro

Yes, I think both options should be available for users.

On 02-Feb-2018 1:54 PM, "Sandro Mani" [email protected] wrote:

So you are using PoDoFo "PDF with invisible text overlayer" and QPrinter for regular PDF, right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manisandro/gImageReader/issues/291#issuecomment-362518182, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o3PqSKZ49KpFF5a5N1RCEnBpy-9Nks5tQsZVgaJpZM4Rat5f .

Shreeshrii avatar Feb 02 '18 14:02 Shreeshrii

Does the hocr option have a way to display only those words which have low confidence? Might make it easier for correction.

Also, using different traineddata files, in hocr mode, diff words get dropped from recognition.

Shreeshrii avatar Feb 02 '18 18:02 Shreeshrii

Does the hocr option have a way to display only those words which have low confidence? Might make it easier for correction.

How would you define "low"?

Also, using different traineddata files, in hocr mode, diff words get dropped from recognition.

Not following what you mean.

manisandro avatar Feb 02 '18 18:02 manisandro

  1. Low could be a user defined percentage. I will have to check, but I think for devanagari documents the confidence level was as low as 0 for some words.

  2. Again, based on test for documents in devanagari script, which can be processed using multiple traineddata files such as devanagari, hin, san, mar and nep. The OCR process drops certain words in recognition. However, diff language traineddata give diff results. Eg. a word may be dropped by devanagari but recognised by hin.

This dropping of words might also be related to confidence levels.

A related question is, is the hocr demarcation of text blocks etc. a common layout analysis routine, or does it dependent on traineddata?

I will provide an example with samples tomorrow. That will help clarify.

Shreeshrii avatar Feb 02 '18 19:02 Shreeshrii

I'm afraid I can't help much with traineddata issues or with what hocr text tesseract produces. You'll have to take those issues upstream.

manisandro avatar Feb 02 '18 19:02 manisandro

You'll have to take those issues upstream.

That's what I thought. Thanks!

I looked at the word confidence values. They range from 0 to 90+. It would be helpful to have a filter on the conf values, so e.g. a user could choose to look at values below 10%, 20%, 50% - any threshold they choose.

conf_level

Shreeshrii avatar Feb 03 '18 02:02 Shreeshrii

I am trying to use PoDoFo for Arabic. Seems letters get reversed in each word.

Hello World becomes olleW dlroW

I suggest giving option to output rtl (which reverse letters in each word).

Because Qprinter is not suitable for monochrome docs. 18mb orig. pdf file becomes 1.4gb pdf with invisible text using qprinter.

hocr2pdf is a good alternative and consider also itext library.

bmwmy avatar Mar 05 '18 15:03 bmwmy

hocr2pdf is just a node.js wrapper around tesseract AFAICS, and itext is a proprietary Java/.NET library.

For it to work in gImageReader I'm afraid I currently don't see any other way than actually implementing the missing support for complex scripts in PoDoFo. This though requires thorough knowledge of the PDF spec and time, both of which are currently lacking.

manisandro avatar Mar 05 '18 15:03 manisandro

@bmwmy is the reversal problem also there in the txt and HOCR output of tesseract?

Shreeshrii avatar Mar 05 '18 16:03 Shreeshrii

Related issue - https://github.com/tesseract-ocr/tesseract/issues/238

Arabic language (right to left in writing) stored (left to right) after create PDF Searchable

Shreeshrii avatar Mar 05 '18 17:03 Shreeshrii

Wow this looks like really painful to handle...

manisandro avatar Mar 05 '18 17:03 manisandro

I am trying to reverse every text child in the code in HOCRPdfExporter.cc line 729 painter.drawText(wordRect.x() * px2pu, y * px2pu, wordItem->text()); to be painter.drawText(wordRect.x() * px2pu, y * px2pu, reverseSTR(wordItem->text()));

this should be enough for Arabic RTL problem not sure about other complex scripts.

but having hard time to compile with docker!

I'll keep trying.

bmwmy avatar Mar 07 '18 11:03 bmwmy

What issues are you encountering with docker? Happy to help there.

manisandro avatar Mar 07 '18 11:03 manisandro

actually using docker toolbox on windows via Virtualbox vm instance When I try fist command to build the image, fedora OS says GPG key missing or something like that. Can I compile for windows from Ubuntu using docker? this was my next trial.

bmwmy avatar Mar 07 '18 11:03 bmwmy

@Shreeshrii no HOCR and plain text outputs are correct

bmwmy avatar Mar 07 '18 11:03 bmwmy

@bmwmy Looks like some transient issues with the Fedora repos, you can work around it by adding --nogpgcheck to [1], i.e. dnf install -y --nogpgcheck.

Sure, you can use docker on any OS it runs on.

[1] https://github.com/manisandro/gImageReader/blob/master/packaging/win32/Dockerfile#L9

manisandro avatar Mar 07 '18 12:03 manisandro

FYI

Please see attached. It is the output from export to pdf from scribus 1.5.4svn. It seems to have correct Arabic support and loaded podofo as one of the components. Not sure if it helps with the hocr3pdf issue.

Related blog post: http://host-oman.blogspot.in/2017/02/first-5-arabic-books-typesetting-in.html

ShreeDevi


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Wed, Mar 7, 2018 at 5:32 PM, Sandro Mani [email protected] wrote:

@bmwmy https://github.com/bmwmy Looks like some transient issues with the Fedora repos, you can work around it by adding --nogpgcheck to [1], i.e. dnf install -y --nogpgcheck.

Sure, you can use docker on any OS it runs on.

[1] https://github.com/manisandro/gImageReader/blob/master/ packaging/win32/Dockerfile#L9

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manisandro/gImageReader/issues/291#issuecomment-371117397, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o-wEmupGgVtXanUAUILYWwl5cEu1ks5tb8xrgaJpZM4Rat5f .

Shreeshrii avatar Mar 07 '18 14:03 Shreeshrii

FYI

It seems that scribus dealt with rtl language differently as I realize they store Arabic text in reverse order. Some of IFs their to detect if it is Arabic text!

https://github.com/scribusproject/scribus/search?q=arabic&unscoped_q=arabic

bmwmy avatar Jul 01 '18 15:07 bmwmy