getDataTm() missing content
- PHP Version: 8.2.7
- PDFParser Version: 2.11
Description:
getText can export all words from a PDF, but getDataTm returns part of them
PDF input
Expected output & actual output
getText output: 备 注 购方开户银行:-; 银行账号:-; 销售方地址:上海市9200号; 电话:999999; 销方开户银行:999999-工行市分行营业部; 银行账号:99999999999; 账户号:99999999,账单月:999999。发票金额不包含赠费等,账单明细可通过网上营业厅查询;
getDataTm output: 备 注 购方开户银行:-; 银行账号:-;
Code
$parser = new Parser();
if (file_exists($pdfFile)) {
$pdf = $parser->parseFile($pdfFile);
$pages = $pdf->getPages();
foreach ($pages as $page) {
$cnt = $page->getText();
print($cnt);
}
}
$parser = new Parser();
if (file_exists($pdfFile)) {
$pdf = $parser->parseFile($pdfFile);
$pages = $pdf->getPages();
foreach ($pages as $page) {
$data = $page->getDataTm();
foreach ($data as $block) {
$cnt = $block[1];
print($cnt."\n");
}
}
}
}
PHP Version: 3.10
Please check again, PHP 3.10 is end-of-life more than 20 years or so.
getText can export all words from a PDF, but getDataTm returns part of them
Also, be more specific here.
Can you provide a test PDF?
sorry,PHP version is 8.2.7. 发自我的 iPhone
在 2025年3月17日,16:53,Konrad Abicht @.***> 写道:
PHP Version: 3.10
Please check again, PHP 3.10 is end-of-life more than 20 years or so.
— Reply to this email directly, view it on GitHubhttps://github.com/smalot/pdfparser/issues/767#issuecomment-2728633280, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMPQUFLO7ZRRX6L2MGDPAOD2U2EP7AVCNFSM6AAAAABZEXL4H2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMRYGYZTGMRYGA. You are receiving this because you authored the thread.Message ID: @.***>
[k00ni]k00ni left a comment (smalot/pdfparser#767)https://github.com/smalot/pdfparser/issues/767#issuecomment-2728633280
PHP Version: 3.10
Please check again, PHP 3.10 is end-of-life more than 20 years or so.
— Reply to this email directly, view it on GitHubhttps://github.com/smalot/pdfparser/issues/767#issuecomment-2728633280, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMPQUFLO7ZRRX6L2MGDPAOD2U2EP7AVCNFSM6AAAAABZEXL4H2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMRYGYZTGMRYGA. You are receiving this because you authored the thread.Message ID: @.***>
sorry, the PHP version is 8.2.7.
I've seen what could the same issue that @shshken is describing. For me, getDataTM() is consistently missing the last text element from a page.
Using this PDF: https://northeastcommunityforest.org.uk/sites/default/files/10261667/2023-08/20230831_NECF_Annual-report-2023.pdf
On the 6th page (with the text "Watch the Town Moor project video") getDataTM() doesn't return the text "Moor project video".
If I step through getDataTM(), the string "Moor project video" is last in the $extractedTexts array, but is never added to the $extractedData array. When the loop over $dataCommands gets to that text, the commands are ET, Q, and then the function returns.
If you run this code with the PDF I linked to, you can see the missing text.
<?php
include './vendor/autoload.php';
$parser = new Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('20230831_NECF_Annual-report-2023.pdf');
foreach ($pdf->getPages() as $i => $page) {
echo "\n\n#####\nPage " . $i+1 . "\n#####\n\n";
foreach ($page->getDataTm() as $data) {
if (isset($data[1]) && ($data[1] !== '')) {
echo $data[1] . "\n";
}
}
}
Eg: Page 1 is missing "northeastcommunityforest.org.uk" Page 2 has an empty string as its last element, so nothing appears missing. Page 3 is missing "wellbeing and contributing to efforts to tackle climate change." Page 4 is missing "Instagram" Page 5 is missing "achievements 2022/23" Page 6 is missing "Moor project video". etc...
I think this issue is related to #733, if not exactly the same thing.
I've just tried parpalak's change in the other issue of commenting out the contents of the 'Do' case in PdfObject::getTextArray() and my test script above now successfully returns all the expected text.
@shshken does #775 help you in this regard?