pdfminer.six icon indicating copy to clipboard operation
pdfminer.six copied to clipboard

pdfminer can't extract text from some pdffiles but pypdf can?

Open ramtalentrecruit opened this issue 2 years ago • 5 comments

Feature request

Thanks for your suggestion on improving pdfminer.six. To helps us discuss and implement this request, please make sure to include the following information:

  • There are a few types of pdf files which contain very detailed information and are in different styles.
  • These pdf_files contain images but text can be extracted without OCR. That's why pypdf can extract information from those pdf_files.

ramtalentrecruit avatar Jan 02 '23 09:01 ramtalentrecruit

Could you provide these pdf files here? also did those pdfs had only images and no text..? If so, then how did you imply that OCR was not used and still text got extracted?

vilabho avatar Feb 11 '23 21:02 vilabho

Thanks for your response. I told you pypdf extracted text from those files, these files contain images+text. Task is to extract text not mages. I can't provide those files here but will be very happy to share in mail. You can send email here

mrm202 avatar Feb 14 '23 03:02 mrm202

I have sent an email, kindly share your files there

vilabho avatar Feb 22 '23 06:02 vilabho

I didn't get your email id. Can you send again please at this email id? [email protected]

mrm202 avatar Feb 22 '23 10:02 mrm202

I have sent the reply again on the mailid mentioned above. Please check in Spam/Junk folder of your inbox as well.

vilabho avatar Feb 22 '23 20:02 vilabho