pdfparser icon indicating copy to clipboard operation
pdfparser copied to clipboard

Missing encoding data for "" error - fix suggestion

Open slydev opened this issue 4 years ago • 7 comments

I had been getting this error for a while and the fix had been to export the PDF with embedded fonts. However I think I have found the real issue.

As 'BaseEncoding' in a PDF is optional so some PDFs may have it null, this will cause the exception.

A suggested fix would be to set 'StandardEncoding' as default if base encoding is null. This may cause the PDF to look different as the user intended, however it will make it parseable instead of chucking the exception.

Code change to the getEncodingClass() function on Encoding.php

    protected function getEncodingClass()
    {
        // Load reference table charset.
        $baseEncoding = preg_replace('/[^A-Z0-9]/is', '', $this->get('BaseEncoding')->getContent());

        //fix for null BaseEncoding set in PDF, will default to standard encoding
	if(!$baseEncoding){
		$baseEncoding = 'StandardEncoding';
	}

        $className = '\\Smalot\\PdfParser\\Encoding\\'.$baseEncoding;

        if (!class_exists($className)) {
            throw new Exception('Missing encoding data for: "'.$baseEncoding.'".');
        }

        return $className;
    }

I am sorry if this has already been solved or not an applicable fix, I just wanted to let you know something that worked and hopefully give some insight into the error that had me head scratching for a while.

If this isn't submitted to the official GIT at least the solution will be here incase anyone else runs into this issue. Thanks.

slydev avatar Oct 03 '21 23:10 slydev

Thank you for reporting.

You wrote:

This may cause the PDF to look different as the user intended, however it will make it parseable instead of chucking the exception.

Can you explain that a bit more please.

k00ni avatar Oct 04 '21 06:10 k00ni

Uh so it will depend on what native PDF reader and version they use, but there is a chance that the preview they have may interpret BaseEncoding = as something more system specific (I.E. MacRomanEncoding or WinAnsiEncoding).

This fix above defaults us to 'StandardEncoding' but I have no idea what the default would be, let alone if it is the same for the hundreds of PDF viewers.

Either way I feel that this fix is better than the error.

slydev avatar Oct 04 '21 09:10 slydev

Thank you for the explanation @slydev.

@j0k3r @izabala @rubenvanerk @smalot I would love to hear your opinion on this, how should PDFParser react in this case?

k00ni avatar Oct 04 '21 12:10 k00ni

The only drawback I can see is that the PDF might be broken if we load the StandardEncoding while in fact the encoding is something completely different (but we can't find it in the first place).

A part from that, what's better : an exception or a bad formatted PDF because of the default encoding?

j0k3r avatar Oct 04 '21 13:10 j0k3r

The only drawback I can see is that the PDF might be broken if we load the StandardEncoding while in fact the encoding is something completely different (but we can't find it in the first place).

A part from that, what's better : an exception or a bad formatted PDF because of the default encoding?

I agree, this is a drawback and honestly a tough question. At the very least this thread should provide much better explanation to the issue for future developers who have it so that they can choose to modify their PDF to have encoding or add the code if they need it.

I just don't know enough about PDF's though I have done a bit of a dig: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf (older 1.4 spec)

From Page 350 "If this entry is absent, the Differences entry describes differences from an implicit base encoding. For a font program that is embedded in the PDF file, the implicit base encoding is the font program's build-in encoding, as described above and further elaborated in the sections on specific font types below. Other-wise, for a nonsymbolic font, it is StandardEncoding, and for a symbolic font, it is the font's built-in encoding."

So my interpretation is that StandardEncoding is right....until it is not, font depending. I am not sure if the parser handles symbolic fonts, if it already does then I feel this is the right answer. If it doesn't then this is probably an issue with any of the encoding anyways so it may still be a good solution.

Thank you all for your interest and time.

slydev avatar Oct 04 '21 14:10 slydev

Could we provide that default encoding using the config @k00ni? By default, it throw an error if no encoding are found. In the error message, we can explain that it might work if you enable useStandardEncoding in the config? Then, if that config is true, we define StandardEncoding as default.

j0k3r avatar Oct 04 '21 14:10 j0k3r

Making this behavior configurable sounds like a reasonable solution. Can one of you please suggest something as pull request?

k00ni avatar Oct 08 '21 11:10 k00ni