pdfparser
pdfparser copied to clipboard
Missing encoding data for "" error - fix suggestion
I had been getting this error for a while and the fix had been to export the PDF with embedded fonts. However I think I have found the real issue.
As 'BaseEncoding' in a PDF is optional so some PDFs may have it null, this will cause the exception.
A suggested fix would be to set 'StandardEncoding' as default if base encoding is null. This may cause the PDF to look different as the user intended, however it will make it parseable instead of chucking the exception.
Code change to the getEncodingClass() function on Encoding.php
protected function getEncodingClass()
{
// Load reference table charset.
$baseEncoding = preg_replace('/[^A-Z0-9]/is', '', $this->get('BaseEncoding')->getContent());
//fix for null BaseEncoding set in PDF, will default to standard encoding
if(!$baseEncoding){
$baseEncoding = 'StandardEncoding';
}
$className = '\\Smalot\\PdfParser\\Encoding\\'.$baseEncoding;
if (!class_exists($className)) {
throw new Exception('Missing encoding data for: "'.$baseEncoding.'".');
}
return $className;
}
I am sorry if this has already been solved or not an applicable fix, I just wanted to let you know something that worked and hopefully give some insight into the error that had me head scratching for a while.
If this isn't submitted to the official GIT at least the solution will be here incase anyone else runs into this issue. Thanks.
Thank you for reporting.
You wrote:
This may cause the PDF to look different as the user intended, however it will make it parseable instead of chucking the exception.
Can you explain that a bit more please.
Uh so it will depend on what native PDF reader and version they use, but there is a chance that the preview they have may interpret BaseEncoding =
This fix above defaults us to 'StandardEncoding' but I have no idea what the default would be, let alone if it is the same for the hundreds of PDF viewers.
Either way I feel that this fix is better than the error.
Thank you for the explanation @slydev.
@j0k3r @izabala @rubenvanerk @smalot I would love to hear your opinion on this, how should PDFParser react in this case?
The only drawback I can see is that the PDF might be broken if we load the StandardEncoding while in fact the encoding is something completely different (but we can't find it in the first place).
A part from that, what's better : an exception or a bad formatted PDF because of the default encoding?
The only drawback I can see is that the PDF might be broken if we load the
StandardEncodingwhile in fact the encoding is something completely different (but we can't find it in the first place).A part from that, what's better : an exception or a bad formatted PDF because of the default encoding?
I agree, this is a drawback and honestly a tough question. At the very least this thread should provide much better explanation to the issue for future developers who have it so that they can choose to modify their PDF to have encoding or add the code if they need it.
I just don't know enough about PDF's though I have done a bit of a dig: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf (older 1.4 spec)
From Page 350 "If this entry is absent, the Differences entry describes differences from an implicit base encoding. For a font program that is embedded in the PDF file, the implicit base encoding is the font program's build-in encoding, as described above and further elaborated in the sections on specific font types below. Other-wise, for a nonsymbolic font, it is StandardEncoding, and for a symbolic font, it is the font's built-in encoding."
So my interpretation is that StandardEncoding is right....until it is not, font depending. I am not sure if the parser handles symbolic fonts, if it already does then I feel this is the right answer. If it doesn't then this is probably an issue with any of the encoding anyways so it may still be a good solution.
Thank you all for your interest and time.
Could we provide that default encoding using the config @k00ni?
By default, it throw an error if no encoding are found. In the error message, we can explain that it might work if you enable useStandardEncoding in the config? Then, if that config is true, we define StandardEncoding as default.
Making this behavior configurable sounds like a reasonable solution. Can one of you please suggest something as pull request?