pdf-util junk characters coming in retrieved text using PDF util class method getText

I am using below code to get whole PDF text into strings and then compare of both string. String str = pdfutil.getText("C:\Users"+System.getProperty("user.name")+"\Downloads"+""+prereport+".pdf"); String str1 = pdfutil.getText("C:\Users"+System.getProperty("user.name")+"\Downloads"+""+postreport+".pdf"); System.out.println("Check the text from both PDFs : " + str.equalsIgnoreCase(str1));

When I retrieve pdf text into string instead of text am getting below type of characters in retrieved string. jlkqeiv qobka obmloq _v ^``lrkq mêÉé~êÉÇ=Ñçê ^g^o ag _ìííÉêÑáÉäÇ jçåíÜäó=qêÉåÇ=oÉéçêí=ÖÉåÉê~íÉÇ=çå lÅíçÄÉê=NPI=OMNT=~í=PWMR=~ã=EbpqF

Oct 25 '17 05:10 madhur-dumane

Can you share the PDFs?

Oct 25 '17 12:10 vinsguru

its contain confidential data so we can not share

Nov 03 '17 09:11 madhur-dumane

pdf

Above is the first page of PDF and respective retrieved text is like:-

Length 18332 Text : jlkqeiv qobka obmloq _v ^``lrkq mêÉé~êÉÇ=Ñçê ^g^o ag _ìííÉêÑáÉäÇ jçåíÜäó=qêÉåÇ=oÉéçêí=ÖÉåÉê~íÉÇ=çå lÅíçÄÉê=NPI=OMNT=~í=PWMR=~ã=EbpqF AJA

from last 3 lines its retrieving correctly but Monthly Trend Report that is coming in junk characters.I have erased some data from image. similarly on next page "Page3 of 8" like text is there which is also retrived similar to this only. Note-I have replaced generated time with null string means while retrieving covert string"Generated----EST" to null.

Please help us to resolve this issue.

Nov 08 '17 09:11 madhur-dumane

any solution?

Nov 10 '17 10:11 madhur-dumane

Unfortunately I am unable to replicate the issue. What OS do you use?

Nov 10 '17 15:11 vinsguru

Windows 7

Nov 13 '17 06:11 madhur-dumane

any solution on this?

Nov 20 '17 09:11 madhur-dumane

I tried with different pdfs. I am finding it very difficult to replicate! That's why I could not fix this. pdf-util internally uses apache pdf-box. The problem could be at pdfbox as well.

Nov 20 '17 19:11 vinsguru

I am checking if I could share PDF with you but if issue will be in pdf box then can we fix it or not?

Nov 28 '17 08:11 madhur-dumane

Is there any other way to do PDF Comparison

Jan 08 '18 06:01 madhur-dumane

after more analysis we got to know that if the font of text in PDF is [PDType1CFont SUBSET+CZGA00T1U00037] then its not retriving text correctly. The pdf document which we are trying to read is having some custom fonts embedded in it. Can you check now what can be done?

Jan 17 '18 12:01 madhur-dumane

Can you please check on this?

Jan 19 '18 10:01 madhur-dumane

Hi

Can you please check my latest comment on this issue and let us know if you have any solution.

Thanks in advance Madhuri

On 21-Nov-2017 1:00 AM, "Vinoth Selvaraj" [email protected] wrote:

I tried with different pdfs. I am finding it very difficult to replicate! That's why I could not fix this. pdf-util internally uses apache pdf-box. The problem could be at pdfbox as well.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vinsguru/pdf-util/issues/9#issuecomment-345803119, or mute the thread https://github.com/notifications/unsubscribe-auth/AfjAxDP6gl0ftLqNFOduGZhVDPo2kca_ks5s4dNBgaJpZM4QFc4n .

Jan 19 '18 10:01 madhur-dumane

pdf-util pdf-util copied to clipboard

junk characters coming in retrieved text using PDF util class method getText

pdf-util
pdf-util copied to clipboard