rmast comments

Results 184 comments of


                                            rmast

pdfcomp: problems with inverted text that is often better in hocr.

If I invert the complete image via [https://pinetools.com/invert-image-colors](url) and repeat the steps all text seems correct in tesseract and sharp in the resulting PDF, despite both inverted and non-inverted text...

pdfcomp: problems with inverted text that is often better in hocr.

I found a workaround to get the OCR correct: Create a file tess.cfg containing ``` tessedit_do_invert True ``` And call ``` ocrmypdf -l nld 175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg --tesseract-config tess.cfg ocrkwaliteit.pdf ``` The...

pdfcomp: problems with inverted text that is often better in hocr.

[The new parameter Stefan Weil suggests](https://github.com/tesseract-ocr/tesseract/pull/3141) gives the same error.

pdfcomp: problems with inverted text that is often better in hocr.

When I look at the extracted hocr from this "array"-containing PDF it twice contains the "wis-clear" part on the right top of the image, unfortunately both with confidence 100. I...

pdfcomp: problems with inverted text that is often better in hocr.

You can already see these are separately recognized words, for example the third coordinate of the first 'w' differs from the second. But Stefan says this is not by design,...

pdfcomp: problems with inverted text that is often better in hocr.

I didn't get the print/wis-clear correctly read in automatically in plain Tesseract. Looking around for a solution I stumbled into [EasyOCR](https://github.com/JaidedAI/EasyOCR), which doesn't have HOCR-output, but comes with something similar...

pdfcomp: problems with inverted text that is often better in hocr.

Playing around with the new You.com YouChat, which is free to use at the moment you can ask questions which are answered ChatGPT-like, but including references and actual results from...

Use (not yet released) pdf->hocr conversation to improve compression for existing PDFs

If you could recognize the font and it's a freely available font then you could replace the invisible text by the visible font and remove the jb2. One of the...

correct ratio determination for noise estimation

The second commit is for solving this error: https://github.com/internetarchive/archive-pdf-tools/issues/55#issuecomment-1166449630

correct ratio determination for noise estimation

> btw, I think I fixed this in [3c20a46](https://github.com/internetarchive/archive-pdf-tools/commit/3c20a464f53ca0524268e35b998036d18b380b45) - can you confirm? Without resetting up and retesting it I read through the issues to see what we were trying...