pdfparser icon indicating copy to clipboard operation
pdfparser copied to clipboard

Invalid object reference for $obj.

Open noise3 opened this issue 7 months ago • 3 comments

  • PHP Version: 8.0
  • PDFParser Version: 2.12.0

Description: With this PDF I get this error: "Invalid object reference for $obj." on line 544 of the file RawDataParser.php

PDF input

gráfico anual collado.pdf

Expected output & actual output

The text of the file

Code

$parser = new \Smalot\PdfParser\Parser();

// Leemos el PDF try { $config = new \Smalot\PdfParser\Config(); $config->setIgnoreEncryption(true); $pdf = $parser->parseFile($pdfGenCu_FicheroTemporalRecibido, $config);

// Sacamos el texto y por si acaso, lo pasamos todo a mayúsculas
$contenidoPDF = $pdf->getText();
$contenidoPDF = strtoupper($contenidoPDF);
// Vamos a quitar los dobles espacios si hubiera
$contenidoPDF = preg_replace('/\s\s+/', ' ', $contenidoPDF);

} catch (Exception $e) { $alert_text = 'Error al procesar el PDF: ' . $e->getMessage(); if (method_exists($e, 'getLine') && $e->getLine() !== null) { $alert_text .= ' en la línea ' . $e->getLine(); } if (method_exists($e, 'getFile') && $e->getFile() !== null) { $alert_text .= ' en el fichero ' . $e->getFile(); } }

noise3 avatar May 13 '25 18:05 noise3

Hello,

I can confirm this bug and I've managed to create a minimal, reproducible test case using fuzzing, which might be easier to debug than a full PDF file.

Test Case:

This small input file consistently triggers the "Invalid object reference for $obj." error.

  1. Save the following content to a file named minimized-crash.txt:
echo "JVBERi2oqApzdGFydHhyZWYKMjYKJSVFT0Y="|base64 -d >./minimized-crash.txt
  1. Run this PHP script:
<?php
require 'vendor/autoload.php';
$bad_pdf_content = file_get_contents(__DIR__ . '/minimized-crash.txt');
$parser = new \Smalot\PdfParser\Parser();
try {
    $parser->parseContent($bad_pdf_content);
} catch (\Throwable $t) {
    echo "CRASH (Reproduced): " . $t->getMessage() . "\n";
}

This should help pinpoint the issue more quickly. Thanks!

N0zoM1z0 avatar Jul 28 '25 10:07 N0zoM1z0

@N0zoM1z0 Your test data just contains a single reference to an object that doesn't exist. When trying to parse that data, pdfparser throwing the exception saying it can't find the referenced object seems correct to me.

What we need to establish, I think, is if the example PDF that @noise3 uploaded actually contains a reference to an object that doesn't exist, or if the object does in fact exist, and pdfparser isn't finding it for some other reason.

rupertj avatar Aug 04 '25 08:08 rupertj

@noise3 although this isn't a direct solution to your issue, I was facing the same problem and was able to find a workaround by using qpdf to fix the corrupted PDF before parsing it with this library.

Here is the output of qpdf --check grafico.anual.collado.pdf:

WARNING: grafico.anual.collado.pdf: file is damaged
WARNING: grafico.anual.collado.pdf (offset 369507): xref not found
WARNING: grafico.anual.collado.pdf: Attempting to reconstruct cross-reference table
checking grafico.anual.collado.pdf
PDF Version: 1.5
File is not encrypted
File is not linearized
qpdf: operation succeeded with warnings

As you can see, there are issues with the xref table which are preventing pdfparser from working correctly. The solution I found was just to run qpdf grafico.anual.collado.pdf fixed_file.pdf and then running pdfparser as usual on fixed_file.pdf. Hope this workaround can be helpful in the meantime!

plt3 avatar Sep 05 '25 14:09 plt3