pdfparser
pdfparser copied to clipboard
Missing pages
when parsing this folder the number of pages (25) does not match the actual page number (37).
In case someone wanna work on this: Can we include linked PDF in our test environment? It must be free of charge and without any obligations.
~~If yes, I will upload it to Github so we don't have to rely on your link.~~ PDFS blocked by Github
Yes it's a digital promo folder for a dutch supermarket. thx for the speedy response.
I took a look at this. I couldn't find the problem. I didn't catch where is the error. I only know that after the method Parser::parseContent
is called, and exactly after the line: $document->setObjects($this->objets);
of that method; is when you start seeing 25 pages instead of the 37. But I am not skillful enough in the Parse::ParseContent
to understand how the raw data that is in $data
variable (after the call of list($xref, $data) = $this->rawDataParser->parseData($content);
) could be interpreted to know what is a page or not, and what is the problem.
I ran a example, just for see if with other php library, I have the same problem. I used FPDI to get the page number, but as soon as FPDI get the file, php throw the following Fatal error:
PHP Fatal error: Uncaught setasign\Fpdi\PdfParser\CrossReference\CrossReferenceException: This PDF document probably uses
a compression technique which is not supported by the free parser shipped with FPDI.
(See https://www.setasign.com/fpdi-pdf-parser for more details) in vendor\setasign\fpdi\src\PdfParser\CrossReference\CrossReference.php:257
I just include this, to see if this could help others to find the problem.
Sorry, I can't help any more, I am not expert in the method: Parse::ParseContent
and related methods.
@izabala Thank you for trying though!
I just looked into this issue. Firstly, this PDF with the current code in master now returns 32 pages, so there's only 5 missing pages now.
My suspicion is that the problem lies in the following line in Smalot/PdfParser:::parseObject
:
$this->objects[$id] = $object;
The above code always overwrites the object at that position, even if an object is already present at that location. If this is rewritten to:
if (isset($this->objects[$id]) === false) {
$this->objects[$id] = $object;
} else {
$this->objects[$id . random_int(0, PHP_INT_MAX)] = $object;
}
All pages are present. This is not an elegant solution though. I suspect the key for the xref is not complete, but I am not familiar with how we can rewrite the key to prevent colissions. @k00ni Do you have the knowledge to help me in the final stretch here?
Do you have the knowledge to help me in the final stretch here?
Thank you for investing time in resolving this issue. My suggestion to you is to create a pull request, outline your thoughts and provide your code changes. There we can discuss them and try to find a solution. I am currently busy, but if I can I would be happy to support you (and others involved) here.