pdftools
pdftools copied to clipboard
Add link to tesseract in the README?
Maybe some users will come here after getting a pdf that is a scanned image of text and not know what to do?
yea , i agree, will be really usefull. may be add some info or add a function like " is scanned " , with some proven criteria;
the workaround i found , think looking at some stackverflow question ; was look for the fonts. image pdf tend to have 0 or little fonts , meanwhile other stuff not.
Anyway its not safe criteria , in my case got some exceptions that have little amount of fonts and where ok text for the text flag.
and after the function add some comment refering to the tesseract package ; it will save huge amount of time. ( https://cran.r-project.org/web/packages/tesseract/tesseract.pdf )
Also some exp on tesseract on other languages, you can configure it and download some language packages ( like spanish ); it works quite well out of the box; but you must specify the language. ( see the tesseract_download function in the pdf. )
some "not safe" criteria example: https://revistas.unlp.edu.ar/raab/article/view/786/2988
Hope it helps ! Also thanks for the amazing pdftools !