Invalid object reference for $obj.
- PHP Version: 8.0
- PDFParser Version: 2.12.0
Description: With this PDF I get this error: "Invalid object reference for $obj." on line 544 of the file RawDataParser.php
PDF input
Expected output & actual output
The text of the file
Code
$parser = new \Smalot\PdfParser\Parser();
// Leemos el PDF try { $config = new \Smalot\PdfParser\Config(); $config->setIgnoreEncryption(true); $pdf = $parser->parseFile($pdfGenCu_FicheroTemporalRecibido, $config);
// Sacamos el texto y por si acaso, lo pasamos todo a mayúsculas
$contenidoPDF = $pdf->getText();
$contenidoPDF = strtoupper($contenidoPDF);
// Vamos a quitar los dobles espacios si hubiera
$contenidoPDF = preg_replace('/\s\s+/', ' ', $contenidoPDF);
} catch (Exception $e) { $alert_text = 'Error al procesar el PDF: ' . $e->getMessage(); if (method_exists($e, 'getLine') && $e->getLine() !== null) { $alert_text .= ' en la línea ' . $e->getLine(); } if (method_exists($e, 'getFile') && $e->getFile() !== null) { $alert_text .= ' en el fichero ' . $e->getFile(); } }
Hello,
I can confirm this bug and I've managed to create a minimal, reproducible test case using fuzzing, which might be easier to debug than a full PDF file.
Test Case:
This small input file consistently triggers the "Invalid object reference for $obj." error.
- Save the following content to a file named
minimized-crash.txt:
echo "JVBERi2oqApzdGFydHhyZWYKMjYKJSVFT0Y="|base64 -d >./minimized-crash.txt
- Run this PHP script:
<?php
require 'vendor/autoload.php';
$bad_pdf_content = file_get_contents(__DIR__ . '/minimized-crash.txt');
$parser = new \Smalot\PdfParser\Parser();
try {
$parser->parseContent($bad_pdf_content);
} catch (\Throwable $t) {
echo "CRASH (Reproduced): " . $t->getMessage() . "\n";
}
This should help pinpoint the issue more quickly. Thanks!
@N0zoM1z0 Your test data just contains a single reference to an object that doesn't exist. When trying to parse that data, pdfparser throwing the exception saying it can't find the referenced object seems correct to me.
What we need to establish, I think, is if the example PDF that @noise3 uploaded actually contains a reference to an object that doesn't exist, or if the object does in fact exist, and pdfparser isn't finding it for some other reason.
@noise3 although this isn't a direct solution to your issue, I was facing the same problem and was able to find a workaround by using qpdf to fix the corrupted PDF before parsing it with this library.
Here is the output of qpdf --check grafico.anual.collado.pdf:
WARNING: grafico.anual.collado.pdf: file is damaged
WARNING: grafico.anual.collado.pdf (offset 369507): xref not found
WARNING: grafico.anual.collado.pdf: Attempting to reconstruct cross-reference table
checking grafico.anual.collado.pdf
PDF Version: 1.5
File is not encrypted
File is not linearized
qpdf: operation succeeded with warnings
As you can see, there are issues with the xref table which are preventing pdfparser from working correctly. The solution I found was just to run qpdf grafico.anual.collado.pdf fixed_file.pdf and then running pdfparser as usual on fixed_file.pdf. Hope this workaround can be helpful in the meantime!