gImageReader
gImageReader copied to clipboard
Investigate complex scripts in PoDoFo PDF export
The pdf output is not correct for Devanagari script when using the 3.2.3 experimental version for tesseract 4.0.0alpha.
Please see attached zip file with input image, text, hocr and pdf output.
If I copy the text from pdf and paste in notepad++, the rendering is correct. However rendering in the pdf file itself is incorrect.
I fear this is a general issue with PoDoFo and complex scripts - resp more work is needed have PoDoFo handle these correctly.
Actually, isn't it just a matter of picking the right font? I tried with a test image you sent me a while ago, installed the Lohit Devanagari font, selected that font for PDF export, and the output looks reasonable (from what I can judge), see attachment.
It should work correctly with any Devanagari Unicode font.
The problem is not the font, rather it is the complex script rendering. In Devanagari there is reordering of of certain combining marks. Also, multiple consonants together give rise to different glyphs.
PoDoFo exported pdf has letters overlapping each other. The combining mark for i maatraa is not getting reordered to before the consonant - see lines 2 and 3.
I copied the text from the pdf you posted above into notepad++ and then printed it as pdf (in Lohit Devanagari font) so that it is easy to compare.
Please see attached.
Ah I see. Do you have any idea how tesseract handles this?
I think Cairo, Pango, Harfbuzz etc provide the support.
I had done a search in podofo archives earlier today, the only ref I found related to this is in the thread https://sourceforge.net/p/podofo/mailman/message/32425071/ As of 2014, it seemed that podofo did not support this.
Yeah I read the same thread - as I read it, PoDoFo isn't capable of handling it for you, but it should be possible to handle it with custom code outside of PoDoFo.
But looking at the tesseract source, in particular pdfrenderer.cpp, I see no traces of pango or harfbuzz. It would be sufficient to figure out the low-level blocks that tesseract adds to the PDF, I can then just also write low-level blocks via PoDoFo instead of using the DrawText method I suppose.
https://github.com/phuang/pango https://www.cairographics.org/
Take a look at stringrenderer
https://github.com/tesseract-ocr/tesseract/blob/c773eb5784a9b895008240f23054d2ff916786a5/training/stringrenderer.cpp
Okay I'll take a look when I find a moment.
Maybe it will work better with the Qtprinter.
http://doc.qt.io/qt-5/internationalization.html
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Thu, Jan 11, 2018 at 8:45 PM, Sandro Mani [email protected] wrote:
Okay I'll take a look when I find a moment.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/manisandro/gImageReader/issues/291#issuecomment-356963831, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_oyT8NRRi7iams6ovQ_L5M2E10ZQgks5tJiV6gaJpZM4Rat5f .
@Shreeshrii I've added a QPrinter backend for PDF export, please give it a try.
@manisandro Thanks for addressing this issue. Do you have a windows binary that I can test? I am on windows 10.
I tried with a test image you sent me a while ago, installed the Lohit Devanagari font, selected that font for PDF export, and the output looks reasonable (from what I can judge), see attachment.
If it is not possible to provide the windows binary now, please create the test output as you had done before.
Here you go:
- 32 bit: https://smani.fedorapeople.org/tmp/gImageReader_3.2.3_qt5_i686.exe
- 64 bit: https://smani.fedorapeople.org/tmp/gImageReader_3.2.3_qt5_x86_64.exe
Thanks!! It is working great. I tested with Devanagari, san (Sanskrit) and Gurmukhi traineddata files.
I am attaching input files and pdfs from the test.
Two unrelated items that I noticed:
-
In HOCR mode, it is not possible to select a section of image for processing. The selection crosshair is displayed but it does not do any selection.
-
If selecting podofo printer backend, pdf is not created/is zero size/locks the pdf file in some manner. If qtprinter is selected after that, pdf file is not allow to be opened.
Good to see the export to odt option (is this a new feature?).
- Correct, hOCR is always page based (due to the nature of the hOCR format). While clearly a subset of a document can also be seen as a hOCR page, things get complicated when you have to start merging hOCR documents which represent different portions of the same image.
- Need to investigate, might be a regression with the code I introduced last night.
Yes, ODT is indeed new.
Overall testing is very much welcome since I'd like to push out a new release soon.
-
Then in HOCR mode the selection crosshair should not be displayed.
-
I have not tested podofo with a english document, just with these complex script ones. At one time I saw a help text telling that qtprinter should be used for complex scripts - but not sure where the cursor was hovering at that point. I couldn't get it to display again.
Would it help to make qtprinter as the default choice showing up in export pdf dialogue for complex scripts?
- Valid point
- Just click on the hint-icon next to the combobox
- I don't know how do reliably detect whether complex scripts are involved.
When using `pdf with invisible text overlay', the pdf file size becomes much larger.
eg. using a 600dpi image of 512kb size. The resulting pdf is 1175kb with default setting of 300dpi in the export dialog.
Ah apropos, I see now that the windows build is missing some icons, hence why you i.e. can't see the hint icon.
Size: that's the price of QPrinter. Nothing I can do about that. QPrinter internally hard-codes the image compression method to JPEG@94% quality.
Please see http://doc.qt.io/qt-5/qimagewriter.html
Qt provides the QImageWriter class which supports setting format specific options, such as the gamma level, compression level and quality, prior to storing the image.
By changing the image options in export `pdf with invisible text overlay' with qtprinter, the pdf size can be reduced.
I changed the settings from color to monchrome and dpi from 300 to 100.
The resulting pdf size is now 355 kb instead of 1175kb.
Please see http://doc.qt.io/qt-5/qimagewriter.html
Sure, but QPrinter does not use QImageWriter
You could offer option to change the printermode as part of export pdf dialog
enum | PrinterMode { ScreenResolution, PrinterResolution, HighResolution } |
---|
That enum has no effect since it is overridden by the resolution the user chooses.
I changed the settings from color to monchrome and dpi from 300 to 100. The resulting pdf size is now 355 kb instead of 1175kb.
Changed format to grayscale instead of monochrome and 100 dpi, resulting pdf is 276kb.
Of course, without original image, the pdf size is much smaller, so could be made at 300 dpi.
For monochrome you really need CCITT/FAX encoding to have a reasonably small file size, but as mentioned, it is not doable with QPrinter.
Thanks!
Going back to the original issue report and current status:
originally with podofo, for Devanagari script
If I copy the text from pdf and paste in notepad++, the rendering is correct. However rendering in the pdf file itself is incorrect.
currently with podofo, for Devanagari script
pdf is not created/is zero size/locks the pdf file.
currently with qtprinter, for Devanagari script
The rendering in pdf preview and pdf file is correct. Overlapping character problem can be fixed by reducing the font size %. However, when I copy the text from pdf and paste in notepad++, the rendering is incorrect.
The rendering in pdf preview and pdf file is correct. Overlapping character problem can be fixed by reducing the font size %. However, when I copy the text from pdf and paste in notepad++, the rendering is incorrect.
Well that sucks. I don't think there is anything I can do here... Again, it is QPainter internals.
Assuming that the regression regarding podofo and Devanagari can be fixed, I think the best option might be to use
Podofo With invisible text layer pdf With the fax level compression for the image
That way, the visible part of pdf will be correct since it uses the original image.
And, the text layer will be correct (as per earlier test with podofo).
PoDoFo can definitely be fixed, I'll test it on windows this evening and see what went wrong, I'll post a fresh test build as soon as I fixed things.
It is kinda odd though that the Devanagari script is correctly rendered using QPainter, but is wrong when copying.
This is a known problem with most pdf writers for complex scripts. The glyphs for combined consonants, reordered combining marks do not get copied correctly from pdfs.
Xetex with its support for actual text renders it correctly, and so also PoDoFo, based on my earlier test.
Pdfs created by Open office, libre office also have same problems.
complex script text can also be copied correctly from pdfs created by tesseract, which use the original image for the visual layer.
About the PoDoFo locking issue: isn't it just that you have the output PDF open in a PDF viewer or such which is locking the file?
With PoDoFo
Export to PDF dialog closes but there is no indication whether the export is completed.
When I look in File Manager, it shows a pdf of 0kb.
On refreshing File Manager after a while, pdf file shows up with a size.
When I double click to open it, Adobe Reader gives an error saying file in use or open in another application.
So, it seems to be locked by gimagereader.
Are you creating a new file when exporting or overwriting an existing one? If the latter, are you sure that file isn't open in another application?
I've updated the test builds with a couple of fixes, one might be related to the issue you are seeing. Links as usual:
- 32 bit: https://smani.fedorapeople.org/tmp/gImageReader_3.2.3_qt5_i686.exe
- 64 bit: https://smani.fedorapeople.org/tmp/gImageReader_3.2.3_qt5_x86_64.exe
Thanks for the prompt test build. It is working fine now, i.e.
- HOCR mode, crosshair not being displayed.
- Hint-icon is being displayed next to combo-box.
- PoDoFo printer is NOT locking the pdf file.
For Devanagari, the export to pdf option that worked well for my test image:
- PoDoFo for PDF export backend.
- PDF with invisible text overlay for PDF export output mode.
- Image Settings
- Format Grayscale
- DPI 100
- Compression Jpeg
- Compression quality 60
The generated pdf is 164 kb. Original image was 512kb at 600 dpi. The Devanagari text can be correctly copied and pasted as text. (If you want to test this for other complex scripts, export to TXT and export to PDF and copy and paste text from that and compare the two files).
So you are using PoDoFo "PDF with invisible text overlayer" and QPrinter for regular PDF, right?
Yes, I think both options should be available for users.
On 02-Feb-2018 1:54 PM, "Sandro Mani" [email protected] wrote:
So you are using PoDoFo "PDF with invisible text overlayer" and QPrinter for regular PDF, right?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manisandro/gImageReader/issues/291#issuecomment-362518182, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o3PqSKZ49KpFF5a5N1RCEnBpy-9Nks5tQsZVgaJpZM4Rat5f .
Does the hocr option have a way to display only those words which have low confidence? Might make it easier for correction.
Also, using different traineddata files, in hocr mode, diff words get dropped from recognition.
Does the hocr option have a way to display only those words which have low confidence? Might make it easier for correction.
How would you define "low"?
Also, using different traineddata files, in hocr mode, diff words get dropped from recognition.
Not following what you mean.
-
Low could be a user defined percentage. I will have to check, but I think for devanagari documents the confidence level was as low as 0 for some words.
-
Again, based on test for documents in devanagari script, which can be processed using multiple traineddata files such as devanagari, hin, san, mar and nep. The OCR process drops certain words in recognition. However, diff language traineddata give diff results. Eg. a word may be dropped by devanagari but recognised by hin.
This dropping of words might also be related to confidence levels.
A related question is, is the hocr demarcation of text blocks etc. a common layout analysis routine, or does it dependent on traineddata?
I will provide an example with samples tomorrow. That will help clarify.
I'm afraid I can't help much with traineddata issues or with what hocr text tesseract produces. You'll have to take those issues upstream.
You'll have to take those issues upstream.
That's what I thought. Thanks!
I looked at the word confidence values. They range from 0 to 90+. It would be helpful to have a filter on the conf values, so e.g. a user could choose to look at values below 10%, 20%, 50% - any threshold they choose.
I am trying to use PoDoFo for Arabic. Seems letters get reversed in each word.
Hello World becomes olleW dlroW
I suggest giving option to output rtl (which reverse letters in each word).
Because Qprinter is not suitable for monochrome docs. 18mb orig. pdf file becomes 1.4gb pdf with invisible text using qprinter.
hocr2pdf is a good alternative and consider also itext library.
hocr2pdf is just a node.js wrapper around tesseract AFAICS, and itext is a proprietary Java/.NET library.
For it to work in gImageReader I'm afraid I currently don't see any other way than actually implementing the missing support for complex scripts in PoDoFo. This though requires thorough knowledge of the PDF spec and time, both of which are currently lacking.
@bmwmy is the reversal problem also there in the txt and HOCR output of tesseract?
Related issue - https://github.com/tesseract-ocr/tesseract/issues/238
Arabic language (right to left in writing) stored (left to right) after create PDF Searchable
Wow this looks like really painful to handle...
I am trying to reverse every text child in the code in HOCRPdfExporter.cc line 729 painter.drawText(wordRect.x() * px2pu, y * px2pu, wordItem->text()); to be painter.drawText(wordRect.x() * px2pu, y * px2pu, reverseSTR(wordItem->text()));
this should be enough for Arabic RTL problem not sure about other complex scripts.
but having hard time to compile with docker!
I'll keep trying.
What issues are you encountering with docker? Happy to help there.
actually using docker toolbox on windows via Virtualbox vm instance When I try fist command to build the image, fedora OS says GPG key missing or something like that. Can I compile for windows from Ubuntu using docker? this was my next trial.
@Shreeshrii no HOCR and plain text outputs are correct
@bmwmy Looks like some transient issues with the Fedora repos, you can work around it by adding --nogpgcheck
to [1], i.e. dnf install -y --nogpgcheck
.
Sure, you can use docker on any OS it runs on.
[1] https://github.com/manisandro/gImageReader/blob/master/packaging/win32/Dockerfile#L9
FYI
Please see attached. It is the output from export to pdf from scribus 1.5.4svn. It seems to have correct Arabic support and loaded podofo as one of the components. Not sure if it helps with the hocr3pdf issue.
Related blog post: http://host-oman.blogspot.in/2017/02/first-5-arabic-books-typesetting-in.html
ShreeDevi
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
On Wed, Mar 7, 2018 at 5:32 PM, Sandro Mani [email protected] wrote:
@bmwmy https://github.com/bmwmy Looks like some transient issues with the Fedora repos, you can work around it by adding --nogpgcheck to [1], i.e. dnf install -y --nogpgcheck.
Sure, you can use docker on any OS it runs on.
[1] https://github.com/manisandro/gImageReader/blob/master/ packaging/win32/Dockerfile#L9
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/manisandro/gImageReader/issues/291#issuecomment-371117397, or mute the thread https://github.com/notifications/unsubscribe-auth/AE2_o-wEmupGgVtXanUAUILYWwl5cEu1ks5tb8xrgaJpZM4Rat5f .
FYI
It seems that scribus dealt with rtl language differently as I realize they store Arabic text in reverse order. Some of IFs their to detect if it is Arabic text!
https://github.com/scribusproject/scribus/search?q=arabic&unscoped_q=arabic