pdfminer scanned pdf or native pdf

scanned pdf or native pdf

Open longbowking opened this issue 5 years ago • 2 comments

Given a pdf file, how to judge whether it is a native pdf or a scanned pdf by using pdfminer, any suggestions?

Dec 02 '19 11:12 longbowking

You could extract all text and consider the pdf as "native" if there is too little. Of course this would fail for "native" pdf's that have no text.

Dec 02 '19 11:12 himanshugarg

Yes, that's another case where this heuristic will fail.

Dec 02 '19 11:12 himanshugarg