pdfparser Missing pages

when parsing this folder the number of pages (25) does not match the actual page number (37).

Oct 19 '21 14:10 zimonh

In case someone wanna work on this: Can we include linked PDF in our test environment? It must be free of charge and without any obligations.

~~If yes, I will upload it to Github so we don't have to rely on your link.~~ PDFS blocked by Github

Oct 20 '21 06:10 k00ni

Yes it's a digital promo folder for a dutch supermarket. thx for the speedy response.

Oct 20 '21 08:10 zimonh

I took a look at this. I couldn't find the problem. I didn't catch where is the error. I only know that after the method Parser::parseContent is called, and exactly after the line: $document->setObjects($this->objets); of that method; is when you start seeing 25 pages instead of the 37. But I am not skillful enough in the Parse::ParseContent to understand how the raw data that is in $data variable (after the call of list($xref, $data) = $this->rawDataParser->parseData($content);) could be interpreted to know what is a page or not, and what is the problem.

I ran a example, just for see if with other php library, I have the same problem. I used FPDI to get the page number, but as soon as FPDI get the file, php throw the following Fatal error:

PHP Fatal error:  Uncaught setasign\Fpdi\PdfParser\CrossReference\CrossReferenceException: This PDF document probably uses
 a compression technique which is not supported by the free parser shipped with FPDI. 
(See https://www.setasign.com/fpdi-pdf-parser for more details) in vendor\setasign\fpdi\src\PdfParser\CrossReference\CrossReference.php:257

I just include this, to see if this could help others to find the problem.

Sorry, I can't help any more, I am not expert in the method: Parse::ParseContent and related methods.

Oct 20 '21 20:10 izabala

@izabala Thank you for trying though!

Oct 21 '21 06:10 k00ni

I just looked into this issue. Firstly, this PDF with the current code in master now returns 32 pages, so there's only 5 missing pages now.

My suspicion is that the problem lies in the following line in Smalot/PdfParser:::parseObject:

$this->objects[$id] = $object;

The above code always overwrites the object at that position, even if an object is already present at that location. If this is rewritten to:

if (isset($this->objects[$id]) === false) {
    $this->objects[$id] = $object;
} else {
    $this->objects[$id . random_int(0, PHP_INT_MAX)] = $object;
}

All pages are present. This is not an elegant solution though. I suspect the key for the xref is not complete, but I am not familiar with how we can rewrite the key to prevent colissions. @k00ni Do you have the knowledge to help me in the final stretch here?

May 04 '22 15:05 PrinsFrank

Do you have the knowledge to help me in the final stretch here?

Thank you for investing time in resolving this issue. My suggestion to you is to create a pull request, outline your thoughts and provide your code changes. There we can discuss them and try to find a solution. I am currently busy, but if I can I would be happy to support you (and others involved) here.

May 04 '22 15:05 k00ni

pdfparser pdfparser copied to clipboard

Missing pages

pdfparser
pdfparser copied to clipboard