Parser is skipping the first page

Open datawench opened this issue 1 year ago • 1 comments

I'm parsing the PDF which can be found here: https://oag.ca.gov/system/files/Maxar%20-%20Adult%20CA%20Sample%20Ltr_Redacted.pdf

The parser appears to be skipping the first page, and only extracting text from the last two.

See link above.

I would expect the output to start with "MAXAR SPACE SYSTEMS", or perhaps "I write on behalf of." Instead, this is what I get:

"not been delayed due to any law enforcement investigation. We are also taking additional actions as required..." with interspersed tabs.

I'm using the simplest possible code:

$parser = new Parser();
$pdf = $parser->parseFile($filePath);
$text = $pdf->getText();

Nov 22 '24 15:11 datawench

The first page of the linked PDF is an image of text (and the QR code etc). pdfparser can't extract text from images.

Jul 16 '25 08:07 rupertj