pdfparser
pdfparser copied to clipboard
Text being cut off
- PHP Version: 7.4.33
- PDFParser Version: 2.3.0
Description:
The only line which spans 2 columns is being cut off half way through. I have played with the config settings for spacing etc, with no luck.
The line is an address, and it grabs the first few words and then cuts off, shows a \t and then moves on to the next text in the table
PDF input
Sensitive data so cannot upload the PDF unfortunately. Screenshot is here to show the structure. The one and only line that is being cut off is in orange in the image, and coincidentally is the only one that spans 2 'columnns'
data:image/s3,"s3://crabby-images/392a1/392a15b9ac17cf18a0ca502efe3d6a83ae0e4e7e" alt="Screenshot 2023-01-26 130050"
Expected output & actual output
Expected: 57 FARM ROAD NORWICH NORFOLK IP209UI Output: 57 FARM ROAD
Code
// Parse PDF file and build necessary objects.
$config = new \Smalot\PdfParser\Config(); // fixes the presentation of extra spaces issue
$config->setHorizontalOffset(''); // fixes the presentation of extra spaces issue
$config->setFontSpaceLimit(-60);
$parser = new \Smalot\PdfParser\Parser([], $config);
$pdf = $parser->parseFile('filename.pdf');
$entire_pdf = $pdf->getText();
// $entire_pdf = $pdf->getPages()[0]->getDataTm(); //Testing as an array output still cuts off the address
// echo $entire_pdf;
echo json_encode($entire_pdf);
A snippet of the JSON Encoded version showing the hidden characters: