pdfparser
pdfparser copied to clipboard
File parsed returns unreadable text
Thank you for the awesome Pdfparser library, it really helps me a lot with my projects, and use it quite often recently..
But I found a difficulty with my last project which was my pdf file was converted into some unreadable format text instead of plain text..
My friend told me I should do something with the encoding, but I'm not really sure..
So, I would really glad if anybody can give me some hint or idea if I missing something or it was indeed a bug from the library..
I've tried to parse it as a whole and by each pages btw, it was still not working..
Here is my code,
$PdfParser = new \Smalot\PdfParser\Parser();
$pdf = $PdfParser->parseFile($file);
$text -> getText();
echo $text;
And this is what it returns,
JHGSA IUYSHJG st GUH st GUH HUYGAH st JHGSA st st ttt t ss1 ss2 t 21666 !" #ssst #tt #t #ss1 #ss2 $# $ t%&t '(#tt 2$# t $$sst 2$ 2$'(#t 2$$sst t%&t ' # ttts !"#$ tt $$ss 2$$ss )tst t"+t t$ )tttst ,"1st$'(#t #s%( tt )t $t tst tss1 , )$ %' # ' # t$st HUYGAH IUYSHJG #$#$ %#$# % "t-." s )t$# )t$'(#t $s(%$" t%&t ttt 1st$%$ss tt tt%&t tt%&t ss/%$sst )$$sst )$'(#t %%'0t+#1 2sst t1 )tst )t$$ss tt%&t ttts )$# ,"t%&t ! "#$"#%!& !"#"#%!& !"#"#%!& ! ' ' ( )Y ! Z,-. JHGSA IUYSHJG GUH HUYGAH #sst t ss%tt# )ttt )t, )t )t($ '3st t % # ts%'1 )t*#t /%t$ss ttt%&t & tt(% & t-tts tt )tt 2$%$ss )tss2 )t' # t$st 2$tt%$ss 1st$$ss"tt t )t$$ss"tt t$%$ss t$$sst s$t%t )t%&t )$$ss )$%$ss )$tt%$ss $ t' ss ss/%t ' ss IUYSHJG& ' ss$ %%t1 tss4t tss5st tss5 tss5t tss51 tss52 ' # tst t' # ttts )t' st t$%" tt+t%" st%&t s&tt sss st sst st ! //$01$1% ) 2 (2) 2 () 31!#%%#% !"#"#%!& !"#"#%!& &$4$04%%%% ' ' ! &$4$04%%%% !"#$"#%!& !"#"#%!& !"#"#%!& ! ' ' ( )Y # Z,-.
Can you upload the PDF here so we can use it to reproduce the error? It must be free to use, because we will use it in the test suite.
Hello, I could parse pdf using Smalot\PdfParser and would like to say it's awesome work you have done. Only one problem I face is it's unable to show exact text when Language is : Bengali. Is there any option to get exact text when it's in Bengali Language?
@taherbth how is that related to this issue? If it is not, please open a new issue and provide some more information, like example code/PDF, more info about your setup like PHP version-
Hi @k00ni, I have the same problem with one of my pdfs, at first I thought it was a multipage issue, but if I extract only 1, it continues doing so.
My PHP version is 7.4.29. I attach an example (it is an extracted page since the original file has sensitive information).
The original, apparently, is generated with PDFlib ([Producer] => PDFlib+PDI 9.1.2p1 (C++ legacy/Win64)) but if I edit it with the Mac preview ([Producer] => macOS Version 13.6.1 (Build 22G313) Quartz PDFContext), works fine, but I have no way to manually edit the system originals. For the example, I used the Fpdi library ([Producer] => FPDF 1.86) which has the same parsing errors.
If you need any other information I can provide it. sample.pdf
@lgArlequin Which PDFParser version do you use? Our latest version is 2.8.0-RC1: https://github.com/smalot/pdfparser/releases/tag/v2.8.0-RC1
@k00ni Yes, I changed to that version just in case and I still have the problem. I'm not using composer, I install it manually, I don't know if that can change anything.