pdfparser
pdfparser copied to clipboard
Fatal Error when parsing some PDFs
- PHP Version: 7.4
- PDFParser Version: 2.7.0 ( + 2.8.0-RC2)
Description:
Very recently started getting the following Fatal Error when trying to parse some PDF files...
PHP Fatal error: Uncaught Exception: Invalid object reference for $obj. in >../vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:529 Stack trace: #0 ../vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php(240): >Smalot\PdfParser\RawData\RawDataParser->getIndirectObject('%PDF-1.4\r\n%\xF9\xFA\x9A\xE7...', Array, '4', 203, true) #1 ../vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php(905): >Smalot\PdfParser\RawData\RawDataParser->decodeXrefStream('%PDF-1.4\r\n%\xF9\xFA\x9A\xE7...', 203, Array) #2 ../vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php(216): >Smalot\PdfParser\RawData\RawDataParser->getXrefData('%PDF-1.4\r\n%\xF9\xFA\x9A\xE7...', 203, Array) #3 ../vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php(902): Smalot\PdfParser in >../vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php on line 529"
PDF input
I would be willing to provide a copy of the PDF if I can do so privately.
Expected output & actual output
The expected output of my code is the contents of the PDF parsed into a string of text and ultimately saved to a variable, instead there is a fatal error on certain PDF files and I really can't tell why.
Code
$ext = pathinfo($path, PATHINFO_EXTENSION);
if ( $ext == 'pdf' || $ext == 'PDF') {
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile($path);
$text = $pdf->getText();
}
return $text;
Please try again with our latest version 2.8.0-RC2
Ive run into a issue with (latest version 2.8.0-RC2) and i was using this code:
$config = new \Smalot\PdfParser\Config();
$config->setFontSpaceLimit(-60);
$config->setRetainImageContent(false);
$config->setIgnoreEncryption(true);
// Memory limit to use when de-compressing files, in bytes
$config->setDecodeMemoryLimit(10240);
$parser = new \Smalot\PdfParser\Parser([], $config);
$PDF = $parser->parseFile($PDFfile);
$metaData = $PDF->getDetails();
die(json_encode($metaData, JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES));
expected result would be similar to this:
Code: 200 - {
"CreationDate": "2019-10-31T08:27:44+01:00",
"ModDate": "2019-12-10T07:07:05+01:00",
"Producer": "iText® 5.5.10 ©2000-2015 iText Group NV (****)",
"Pages": 3364, <--- notice this works
"xmp:createdate": "2019-10-31T08:27:44+01:00",
"xmp:modifydate": "2019-12-10T07:07:05+01:00",
"xmp:metadatadate": "2019-12-10T07:07:05+01:00",
"pdf:producer": "iText® 5.5.10 ©2000-2015 iText Group NV (***)",
"xmpmm:documentid": "uuid:5c870642-b206-4312-8c05-2646e3c946a0",
"xmpmm:instanceid": "uuid:729bb9a6-a048-4bcc-996d-d44ca9a5555c",
"dc:format": "application/pdf"
}
The bug iam getting with a bigger PDF (4546 pages) gives this result with that same php code above:
Code: 200 - {
"Pages": 189
}
pdf is: 387 MB (406 340 557 byte)
Thank you for confirming.
I ve got a sample PDFfile regarding similar issue, might be a "index" issue since this code works only to 9th page example code:
for ($x = 0; $x <= 16; $x++) {
$pgcontent = $PDF->getPages()[$x]->getText();
echo("PageNr:".$x."\r\n".$pgcontent);
}
die("Done");
this gives 500 server error even with try and except:
try
{
$PDFContent = $PDF->getText(16);
}
catch (\Exception $e)
{
die( "PDF Problem: " . $e->getMessage());
}
When looking inside the pdffile with FoxIT reader, it reacts likes there is a index issue around pages 8-9. Is it possible to send the pdffile and keeping it private ? :) (feel free to PM me and ask for the file)