pdfparser getDataTm() missing content

PHP Version: 8.2.7
PDFParser Version: 2.11

Description:

getText can export all words from a PDF, but getDataTm returns part of them

PDF input

Expected output & actual output

getText output: 备注购方开户银行：-; 银行账号：-; 销售方地址：上海市9200号; 电话：999999; 销方开户银行：999999-工行市分行营业部; 银行账号：99999999999; 账户号：99999999，账单月：999999。发票金额不包含赠费等，账单明细可通过网上营业厅查询;

getDataTm output: 备注购方开户银行：-; 银行账号：-;

Code

$parser = new Parser();
if (file_exists($pdfFile)) {

    $pdf = $parser->parseFile($pdfFile);
    $pages = $pdf->getPages();
    foreach ($pages as $page) {
        $cnt = $page->getText();
        print($cnt);
    }
}


$parser = new Parser();
if (file_exists($pdfFile)) {
    $pdf = $parser->parseFile($pdfFile);
    $pages = $pdf->getPages();
    foreach ($pages as $page) {
        $data = $page->getDataTm();
        foreach ($data as $block) {
            $cnt = $block[1];
            print($cnt."\n");
            }
        }
    }
}

Mar 17 '25 07:03 shshken

PHP Version: 3.10

Please check again, PHP 3.10 is end-of-life more than 20 years or so.

getText can export all words from a PDF, but getDataTm returns part of them

Also, be more specific here.

Can you provide a test PDF?

Mar 17 '25 08:03 k00ni

sorry，PHP version is 8.2.7. 发自我的 iPhone

在 2025年3月17日，16:53，Konrad Abicht @.***> 写道：

PHP Version: 3.10

Please check again, PHP 3.10 is end-of-life more than 20 years or so.

— Reply to this email directly, view it on GitHubhttps://github.com/smalot/pdfparser/issues/767#issuecomment-2728633280, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMPQUFLO7ZRRX6L2MGDPAOD2U2EP7AVCNFSM6AAAAABZEXL4H2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMRYGYZTGMRYGA. You are receiving this because you authored the thread.Message ID: @.***>

[k00ni]k00ni left a comment (smalot/pdfparser#767)https://github.com/smalot/pdfparser/issues/767#issuecomment-2728633280

PHP Version: 3.10

Please check again, PHP 3.10 is end-of-life more than 20 years or so.

— Reply to this email directly, view it on GitHubhttps://github.com/smalot/pdfparser/issues/767#issuecomment-2728633280, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMPQUFLO7ZRRX6L2MGDPAOD2U2EP7AVCNFSM6AAAAABZEXL4H2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMRYGYZTGMRYGA. You are receiving this because you authored the thread.Message ID: @.***>

Mar 18 '25 04:03 shshken

sorry, the PHP version is 8.2.7.

Mar 18 '25 04:03 shshken

I've seen what could the same issue that @shshken is describing. For me, getDataTM() is consistently missing the last text element from a page.

Using this PDF: https://northeastcommunityforest.org.uk/sites/default/files/10261667/2023-08/20230831_NECF_Annual-report-2023.pdf

On the 6th page (with the text "Watch the Town Moor project video") getDataTM() doesn't return the text "Moor project video".

If I step through getDataTM(), the string "Moor project video" is last in the $extractedTexts array, but is never added to the $extractedData array. When the loop over $dataCommands gets to that text, the commands are ET, Q, and then the function returns.

If you run this code with the PDF I linked to, you can see the missing text.

<?php

include './vendor/autoload.php';

$parser = new Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('20230831_NECF_Annual-report-2023.pdf');
foreach ($pdf->getPages() as $i => $page) {
  echo "\n\n#####\nPage " . $i+1 . "\n#####\n\n";
  foreach ($page->getDataTm() as $data) {
    if (isset($data[1]) && ($data[1] !== '')) {
      echo $data[1] . "\n";
    }
  }
}

Eg: Page 1 is missing "northeastcommunityforest.org.uk" Page 2 has an empty string as its last element, so nothing appears missing. Page 3 is missing "wellbeing and contributing to efforts to tackle climate change." Page 4 is missing "Instagram" Page 5 is missing "achievements 2022/23" Page 6 is missing "Moor project video". etc...

Jul 15 '25 11:07 rupertj

I think this issue is related to #733, if not exactly the same thing.

I've just tried parpalak's change in the other issue of commenting out the contents of the 'Do' case in PdfObject::getTextArray() and my test script above now successfully returns all the expected text.

Jul 16 '25 11:07 rupertj

@shshken does #775 help you in this regard?

Jul 29 '25 14:07 k00ni