pdfparser
pdfparser copied to clipboard
calculateTextWidth throws an error for some fonts
Undefined array key "Widths" in vendor/smalot/pdfparser/src/Smalot/PdfParser/Font.php:279
Not all fonts have the widths data in the font header e.g. $font->getDetails() returns.
Array
(
[Name] => WPJPNX+E13BP
[Type] => Type0
[Encoding] => Identity-H
[BaseFont] => WPJPNX+E13BP
[DescendantFonts] => Array
(
[0] => Array
(
[Name] => WPJPNX+E13BP
[Type] => CIDFontType2
[Encoding] => Ansi
[BaseFont] => WPJPNX+E13BP
[CIDToGIDMap] => Identity
[DW] => 1000
[Subtype] => CIDFontType2
)
)
[Subtype] => Type0
[ToUnicode] => Array
(
[Filter] => FlateDecode
[Length1] => 1024
[Length] => 303
)
)
vs the expected
Array
(
[Name] => MyriadPro-Regular
[Type] => Type1
[Encoding] => WinAnsiEncoding
[BaseFont] => MyriadPro-Regular
[FirstChar] => 1
[FontDescriptor] => Array
(
[Ascent] => 750
[CapHeight] => 674
[Descent] => -250
[Flags] => 32
[FontName] => MyriadPro-Regular
[ItalicAngle] => 0
[StemV] => 80
[Type] => FontDescriptor
)
[LastChar] => 255
[Subtype] => Type1
[Widths] => Array
(
[0] => 0
...
[254] => 471
)
)
Can you provide us the PDF?
Unfortunately the files are Bank Statements, I will need to find a way to remove the elements with sensitive information.
Is there other information about the font I could provide in the meantime?
The most helpful would be PHP exploit code which triggers the error. In the following (untested) a rough example. Please have a look.
/*
* $elements must contain faulty data to trigger the error.
* $header->getDetails() is used inside "calculateTextWidth".
* If it doesnt return an array with key "Widths", the error occur.
*
* You can build $elements yourself or you place var_dump near
* https://github.com/smalot/pdfparser/blob/master/src/Smalot/PdfParser/Font.php#L278
* and use that.
*/
$elements = [
'Name' => ''...',
'Type' => '...',
'Encoding' => '...',
// 'Widths' => '...' <=== must be missing to trigger the error
];
$header = new Header($elements);
$font = new Font(new Document(), $header);
$font->calculateTextWidth('', null); // call this to trigger error
@benlongstaff Ignore my last comment. I realized it is more a basement for a unit test to first trigger the error and after fixing it, make sure it doesn't happen again.
We would need two things fix it:
- a unit test which triggers the error (see my comment with example code above)
- and a check inside the function to avoid the error in case no width is given.
Would you take the time and prepare a pull request? I will lead/assist you until its merged.
Does PDF specification allows no Widths on the font? If so a simple check should be fine (and/or setting a default value even). If it doesn't, its more an anomaly. In this case, what is the best way then? Stick with the check?
Fortunately, the PDF for issue #592 has this font-without-Width problem as well and we already have permission to use it. /samples/bugs/Issue592.pdf
The key thing is, what do we want PdfParser to do in this case? Return zero (0)? Something like (in Font.php):
/**
* Calculate text width with data from header 'Widths'. If width of character is not found then character is added to missing array.
*/
public function calculateTextWidth(string $text, array &$missing = null): ?float
{
$index_map = array_flip($this->table);
$details = $this->getDetails();
// If 'Widths' is not defined for this font, return 0
// See: https://github.com/smalot/pdfparser/issues/570
if (!isset($details['Widths'])) return 0;
$widths = $details['Widths'];
...
The key thing is, what do we want PdfParser to do in this case? Return zero (0)?
I suggest -1 or null because it is an invalid width which is easy to check for. Whatever is returned in this case, the function header should be extended to reflect this behavior.
This function doesn't seem to be used by any other function in PdfParser after running a quick search, so I think returning null, -1 or even false would be okay.
I also have the same issue : font with no Widths that generates a PHP Notice and fail to calculate text width. The test PDF is the same as provided in the issue #629.
The following code, triggers the PHP Notice using the mentioned PDF sample.
<?php
require_once __DIR__.'/pdfparser/alt_autoload.php-dist';
$config = new \Smalot\PdfParser\Config();
$config->setDataTmFontInfoHasToBeIncluded(true);
$parser = new \Smalot\PdfParser\Parser(array(), $config);
$pdf = $parser->parseFile('/tmp/doc.pdf');
$pages = $pdf->getPages();
$lastpage = end($pages);
$data = $lastpage->getDataTm();
echo "Items:".PHP_EOL;
$current_text = null;
foreach($data as $item) {
if(is_array($item)) {
$text = $item[1];
if ($text != $current_text) {
echo "- '$text'".PHP_EOL;
$font = $lastpage->getFont($item[2]);
echo " font: ".$font->getName()." (".$font->getType().")"." size: ".$item[3].PHP_EOL;
$missing = array();
echo " text width: ".$font->calculateTextWidth($text, $missing)." (missing: ".implode(',', $missing).")".PHP_EOL;
$current_text = $text;
}
}
}
PS: this code needs the fix of the issue #629 in order to detect the font properly
Is there something I can do when generating the PDF to fix this issue in the PDF ? I have (a little) control over the PDF generation.
I am mainly interested in making text width calculation works rather than preventing a PHP Notice.
Thank again for you software and contributors. Best regards