pdftools icon indicating copy to clipboard operation
pdftools copied to clipboard

Add link to tesseract in the README?

Open maelle opened this issue 9 years ago • 1 comments

Maybe some users will come here after getting a pdf that is a scanned image of text and not know what to do?

maelle avatar Nov 23 '16 14:11 maelle

yea , i agree, will be really usefull. may be add some info or add a function like " is scanned " , with some proven criteria;

the workaround i found , think looking at some stackverflow question ; was look for the fonts. image pdf tend to have 0 or little fonts , meanwhile other stuff not.

Anyway its not safe criteria , in my case got some exceptions that have little amount of fonts and where ok text for the text flag.

and after the function add some comment refering to the tesseract package ; it will save huge amount of time. ( https://cran.r-project.org/web/packages/tesseract/tesseract.pdf )

Also some exp on tesseract on other languages, you can configure it and download some language packages ( like spanish ); it works quite well out of the box; but you must specify the language. ( see the tesseract_download function in the pdf. )

some "not safe" criteria example: https://revistas.unlp.edu.ar/raab/article/view/786/2988

Hope it helps ! Also thanks for the amazing pdftools !

jas1 avatar Apr 03 '18 09:04 jas1