pdfparser icon indicating copy to clipboard operation
pdfparser copied to clipboard

Text being cut off

Open maxresnikoff opened this issue 2 years ago • 0 comments

  • PHP Version: 7.4.33
  • PDFParser Version: 2.3.0

Description:

The only line which spans 2 columns is being cut off half way through. I have played with the config settings for spacing etc, with no luck.

The line is an address, and it grabs the first few words and then cuts off, shows a \t and then moves on to the next text in the table

PDF input

Sensitive data so cannot upload the PDF unfortunately. Screenshot is here to show the structure. The one and only line that is being cut off is in orange in the image, and coincidentally is the only one that spans 2 'columnns'

Screenshot 2023-01-26 130050

Expected output & actual output

Expected: 57 FARM ROAD NORWICH NORFOLK IP209UI Output: 57 FARM ROAD

Code

// Parse PDF file and build necessary objects.
$config = new \Smalot\PdfParser\Config(); // fixes the presentation of extra spaces issue
$config->setHorizontalOffset(''); // fixes the presentation of extra spaces issue
$config->setFontSpaceLimit(-60);

$parser = new \Smalot\PdfParser\Parser([], $config);
$pdf = $parser->parseFile('filename.pdf');

$entire_pdf = $pdf->getText();
// $entire_pdf = $pdf->getPages()[0]->getDataTm(); //Testing as an array output still cuts off the address
// echo $entire_pdf;
echo json_encode($entire_pdf);

A snippet of the JSON Encoded version showing the hidden characters: Screenshot 2023-01-26 130517

maxresnikoff avatar Jan 26 '23 13:01 maxresnikoff