pdfparser icon indicating copy to clipboard operation
pdfparser copied to clipboard

File parsed returns unreadable text

Open edodije opened this issue 5 years ago • 6 comments

Thank you for the awesome Pdfparser library, it really helps me a lot with my projects, and use it quite often recently.. But I found a difficulty with my last project which was my pdf file was converted into some unreadable format text instead of plain text.. My friend told me I should do something with the encoding, but I'm not really sure.. So, I would really glad if anybody can give me some hint or idea if I missing something or it was indeed a bug from the library.. I've tried to parse it as a whole and by each pages btw, it was still not working.. notconverted

Here is my code,

$PdfParser = new \Smalot\PdfParser\Parser();
$pdf = $PdfParser->parseFile($file);
$text -> getText();
echo $text;

And this is what it returns,

JHGSA IUYSHJG st GUH st GUH HUYGAH st JHGSA st st ttt t ss1 ss2 t 21666 !" #ssst #tt #t #ss1 #ss2 $# $ t%&t '(#tt 2$# t $$sst 2$ 2$'(#t 2$$sst t%&t ' # ttts !"#$ tt $$ss 2$$ss )tst t"+t t$ )tttst ,"1st$'(#t #s%( tt )t $t tst tss1 , )$ %' # ' # t$st HUYGAH IUYSHJG #$#$ %#$# % "t-." s )t$# )t$'(#t $s(%$" t%&t ttt 1st$%$ss tt tt%&t tt%&t ss/%$sst )$$sst )$'(#t %%'0t+#1 2sst t1 )tst )t$$ss tt%&t ttts )$# ,"t%&t ! "#$"#%!& !"#"#%!& !"#"#%!& ! ' ' ( )Y ! Z,-. JHGSA IUYSHJG GUH HUYGAH #sst t ss%tt# )ttt )t, )t )t($ '3st t % # ts%'1 )t*#t /%t$ss ttt%&t & tt(% & t-tts tt )tt 2$%$ss )tss2 )t' # t$st 2$tt%$ss 1st$$ss"tt t )t$$ss"tt t$%$ss t$$sst s$t%t )t%&t )$$ss )$%$ss )$tt%$ss $ t' ss ss/%t ' ss IUYSHJG& ' ss$ %%t1 tss4t tss5st tss5 tss5t tss51 tss52 ' # tst t' # ttts )t' st t$%" tt+t%" st%&t s&tt sss st sst st ! //$01$1% ) 2 (2) 2 () 31!#%%#% !"#"#%!& !"#"#%!& &$4$04%%%% ' ' ! &$4$04%%%% !"#$"#%!& !"#"#%!& !"#"#%!& ! ' ' ( )Y # Z,-.

edodije avatar May 21 '19 03:05 edodije

Can you upload the PDF here so we can use it to reproduce the error? It must be free to use, because we will use it in the test suite.

k00ni avatar Jul 08 '20 07:07 k00ni

Hello, I could parse pdf using Smalot\PdfParser and would like to say it's awesome work you have done. Only one problem I face is it's unable to show exact text when Language is : Bengali. Is there any option to get exact text when it's in Bengali Language?

taherbth avatar Jun 21 '23 17:06 taherbth

@taherbth how is that related to this issue? If it is not, please open a new issue and provide some more information, like example code/PDF, more info about your setup like PHP version-

k00ni avatar Jun 22 '23 06:06 k00ni

Hi @k00ni, I have the same problem with one of my pdfs, at first I thought it was a multipage issue, but if I extract only 1, it continues doing so.

My PHP version is 7.4.29. I attach an example (it is an extracted page since the original file has sensitive information).

The original, apparently, is generated with PDFlib ([Producer] => PDFlib+PDI 9.1.2p1 (C++ legacy/Win64)) but if I edit it with the Mac preview ([Producer] => macOS Version 13.6.1 (Build 22G313) Quartz PDFContext), works fine, but I have no way to manually edit the system originals. For the example, I used the Fpdi library ([Producer] => FPDF 1.86) which has the same parsing errors.

If you need any other information I can provide it. sample.pdf

lgArlequin avatar Nov 30 '23 19:11 lgArlequin

@lgArlequin Which PDFParser version do you use? Our latest version is 2.8.0-RC1: https://github.com/smalot/pdfparser/releases/tag/v2.8.0-RC1

k00ni avatar Dec 01 '23 09:12 k00ni

@k00ni Yes, I changed to that version just in case and I still have the problem. I'm not using composer, I install it manually, I don't know if that can change anything.

lgArlequin avatar Dec 01 '23 12:12 lgArlequin