pdf-util icon indicating copy to clipboard operation
pdf-util copied to clipboard

junk characters coming in retrieved text using PDF util class method getText

Open madhur-dumane opened this issue 8 years ago • 13 comments

I am using below code to get whole PDF text into strings and then compare of both string. String str = pdfutil.getText("C:\Users"+System.getProperty("user.name")+"\Downloads"+""+prereport+".pdf"); String str1 = pdfutil.getText("C:\Users"+System.getProperty("user.name")+"\Downloads"+""+postreport+".pdf"); System.out.println("Check the text from both PDFs : " + str.equalsIgnoreCase(str1));

When I retrieve pdf text into string instead of text am getting below type of characters in retrieved string. jlkqeiv qobka obmloq _v ^``lrkq mêÉé~êÉÇ=Ñçê ^g^o ag _ìííÉêÑáÉäÇ jçåíÜäó=qêÉåÇ=oÉéçêí=ÖÉåÉê~íÉÇ=çå lÅíçÄÉê=NPI=OMNT=~í=PWMR=~ã=EbpqF

madhur-dumane avatar Oct 25 '17 05:10 madhur-dumane

Can you share the PDFs?

vinsguru avatar Oct 25 '17 12:10 vinsguru

its contain confidential data so we can not share

madhur-dumane avatar Nov 03 '17 09:11 madhur-dumane

pdf

Above is the first page of PDF and respective retrieved text is like:-

Length 18332 Text : jlkqeiv qobka obmloq _v ^``lrkq mêÉé~êÉÇ=Ñçê ^g^o ag _ìííÉêÑáÉäÇ jçåíÜäó=qêÉåÇ=oÉéçêí=ÖÉåÉê~íÉÇ=çå lÅíçÄÉê=NPI=OMNT=~í=PWMR=~ã=EbpqF AJA

from last 3 lines its retrieving correctly but Monthly Trend Report that is coming in junk characters.I have erased some data from image. similarly on next page "Page3 of 8" like text is there which is also retrived similar to this only. Note-I have replaced generated time with null string means while retrieving covert string"Generated----EST" to null.

Please help us to resolve this issue.

madhur-dumane avatar Nov 08 '17 09:11 madhur-dumane

any solution?

madhur-dumane avatar Nov 10 '17 10:11 madhur-dumane

Unfortunately I am unable to replicate the issue. What OS do you use?

vinsguru avatar Nov 10 '17 15:11 vinsguru

Windows 7

madhur-dumane avatar Nov 13 '17 06:11 madhur-dumane

any solution on this?

madhur-dumane avatar Nov 20 '17 09:11 madhur-dumane

I tried with different pdfs. I am finding it very difficult to replicate! That's why I could not fix this. pdf-util internally uses apache pdf-box. The problem could be at pdfbox as well.

vinsguru avatar Nov 20 '17 19:11 vinsguru

I am checking if I could share PDF with you but if issue will be in pdf box then can we fix it or not?

madhur-dumane avatar Nov 28 '17 08:11 madhur-dumane

Is there any other way to do PDF Comparison

madhur-dumane avatar Jan 08 '18 06:01 madhur-dumane

after more analysis we got to know that if the font of text in PDF is [PDType1CFont SUBSET+CZGA00T1U00037] then its not retriving text correctly. The pdf document which we are trying to read is having some custom fonts embedded in it. Can you check now what can be done?

madhur-dumane avatar Jan 17 '18 12:01 madhur-dumane

Can you please check on this?

madhur-dumane avatar Jan 19 '18 10:01 madhur-dumane

Hi

Can you please check my latest comment on this issue and let us know if you have any solution.

Thanks in advance Madhuri

On 21-Nov-2017 1:00 AM, "Vinoth Selvaraj" [email protected] wrote:

I tried with different pdfs. I am finding it very difficult to replicate! That's why I could not fix this. pdf-util internally uses apache pdf-box. The problem could be at pdfbox as well.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vinsguru/pdf-util/issues/9#issuecomment-345803119, or mute the thread https://github.com/notifications/unsubscribe-auth/AfjAxDP6gl0ftLqNFOduGZhVDPo2kca_ks5s4dNBgaJpZM4QFc4n .

madhur-dumane avatar Jan 19 '18 10:01 madhur-dumane