pdfparser icon indicating copy to clipboard operation
pdfparser copied to clipboard

calculateTextWidth throws an error for some fonts

Open benlongstaff opened this issue 2 years ago • 8 comments

Undefined array key "Widths" in vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php:279

Not all fonts have the widths data in the font header e.g. $font->getDetails() returns.

Array
(
    [Name] => WPJPNX+E13BP
    [Type] => Type0
    [Encoding] => Identity-H
    [BaseFont] => WPJPNX+E13BP
    [DescendantFonts] => Array
        (
            [0] => Array
                (
                    [Name] => WPJPNX+E13BP
                    [Type] => CIDFontType2
                    [Encoding] => Ansi
                    [BaseFont] => WPJPNX+E13BP
                    [CIDToGIDMap] => Identity
                    [DW] => 1000
                    [Subtype] => CIDFontType2
                )

        )

    [Subtype] => Type0
    [ToUnicode] => Array
        (
            [Filter] => FlateDecode
            [Length1] => 1024
            [Length] => 303
        )

)

vs the expected

Array
(
    [Name] => MyriadPro-Regular
    [Type] => Type1
    [Encoding] => WinAnsiEncoding
    [BaseFont] => MyriadPro-Regular
    [FirstChar] => 1
    [FontDescriptor] => Array
        (
            [Ascent] => 750
            [CapHeight] => 674
            [Descent] => -250
            [Flags] => 32
            [FontName] => MyriadPro-Regular
            [ItalicAngle] => 0
            [StemV] => 80
            [Type] => FontDescriptor
        )

    [LastChar] => 255
    [Subtype] => Type1
    [Widths] => Array
        (
            [0] => 0
            ...
            [254] => 471
        )

)

benlongstaff avatar Jan 23 '23 02:01 benlongstaff

Can you provide us the PDF?

k00ni avatar Jan 23 '23 06:01 k00ni

Unfortunately the files are Bank Statements, I will need to find a way to remove the elements with sensitive information.

Is there other information about the font I could provide in the meantime?

benlongstaff avatar Jan 23 '23 14:01 benlongstaff

The most helpful would be PHP exploit code which triggers the error. In the following (untested) a rough example. Please have a look.

/*
 * $elements must contain faulty data to trigger the error.
 * $header->getDetails() is used inside "calculateTextWidth".
 * If it doesnt return an array with key "Widths", the error occur.
 *
 * You can build $elements yourself or you place var_dump near
 * https://github.com/smalot/pdfparser/blob/master/src/Smalot/PdfParser/Font.php#L278
 * and use that.
 */
$elements = [
    'Name' => ''...',
    'Type' => '...',
    'Encoding' => '...',
    // 'Widths' => '...'       <=== must be missing to trigger the error
];
$header = new Header($elements);

$font = new Font(new Document(), $header);
$font->calculateTextWidth('', null); // call this to trigger error

k00ni avatar Jan 24 '23 07:01 k00ni

@benlongstaff Ignore my last comment. I realized it is more a basement for a unit test to first trigger the error and after fixing it, make sure it doesn't happen again.

We would need two things fix it:

  1. a unit test which triggers the error (see my comment with example code above)
  2. and a check inside the function to avoid the error in case no width is given.

Would you take the time and prepare a pull request? I will lead/assist you until its merged.

Does PDF specification allows no Widths on the font? If so a simple check should be fine (and/or setting a default value even). If it doesn't, its more an anomaly. In this case, what is the best way then? Stick with the check?

k00ni avatar Jan 25 '23 07:01 k00ni

Fortunately, the PDF for issue #592 has this font-without-Width problem as well and we already have permission to use it. /samples/bugs/Issue592.pdf

The key thing is, what do we want PdfParser to do in this case? Return zero (0)? Something like (in Font.php):

    /**
     * Calculate text width with data from header 'Widths'. If width of character is not found then character is added to missing array.
     */
    public function calculateTextWidth(string $text, array &$missing = null): ?float
    {
        $index_map = array_flip($this->table);
        $details = $this->getDetails();

        // If 'Widths' is not defined for this font, return 0
        // See: https://github.com/smalot/pdfparser/issues/570
        if (!isset($details['Widths'])) return 0;

        $widths = $details['Widths'];
...

GreyWyvern avatar Aug 02 '23 17:08 GreyWyvern

The key thing is, what do we want PdfParser to do in this case? Return zero (0)?

I suggest -1 or null because it is an invalid width which is easy to check for. Whatever is returned in this case, the function header should be extended to reflect this behavior.

k00ni avatar Aug 03 '23 05:08 k00ni

This function doesn't seem to be used by any other function in PdfParser after running a quick search, so I think returning null, -1 or even false would be okay.

GreyWyvern avatar Aug 03 '23 23:08 GreyWyvern

I also have the same issue : font with no Widths that generates a PHP Notice and fail to calculate text width. The test PDF is the same as provided in the issue #629.

The following code, triggers the PHP Notice using the mentioned PDF sample.

<?php

require_once __DIR__.'/pdfparser/alt_autoload.php-dist';

$config = new \Smalot\PdfParser\Config();
$config->setDataTmFontInfoHasToBeIncluded(true);
$parser = new \Smalot\PdfParser\Parser(array(), $config);

$pdf = $parser->parseFile('/tmp/doc.pdf');

$pages = $pdf->getPages();
$lastpage = end($pages);
$data = $lastpage->getDataTm();

echo "Items:".PHP_EOL;
$current_text = null;
foreach($data as $item) {
    if(is_array($item)) {
        $text = $item[1];
        if ($text != $current_text) {
            echo "- '$text'".PHP_EOL;
            $font = $lastpage->getFont($item[2]);
            echo "  font: ".$font->getName()." (".$font->getType().")"." size: ".$item[3].PHP_EOL;
            $missing = array();
            echo "  text width: ".$font->calculateTextWidth($text, $missing)." (missing: ".implode(',', $missing).")".PHP_EOL;
            $current_text = $text;
        }
    }
}

PS: this code needs the fix of the issue #629 in order to detect the font properly

Is there something I can do when generating the PDF to fix this issue in the PDF ? I have (a little) control over the PDF generation.

I am mainly interested in making text width calculation works rather than preventing a PHP Notice.

Thank again for you software and contributors. Best regards

mbideau-atreal avatar Aug 07 '23 14:08 mbideau-atreal