pdfparser icon indicating copy to clipboard operation
pdfparser copied to clipboard

getDataTm() missing content

Open shshken opened this issue 9 months ago • 3 comments

  • PHP Version: 8.2.7
  • PDFParser Version: 2.11

Description:

getText can export all words from a PDF, but getDataTm returns part of them

PDF input

Expected output & actual output

getText output: 备 注 购方开户银行:-; 银行账号:-; 销售方地址:上海市9200号; 电话:999999; 销方开户银行:999999-工行市分行营业部; 银行账号:99999999999; 账户号:99999999,账单月:999999。发票金额不包含赠费等,账单明细可通过网上营业厅查询;

getDataTm output: 备 注 购方开户银行:-; 银行账号:-;

Code

$parser = new Parser();
if (file_exists($pdfFile)) {

    $pdf = $parser->parseFile($pdfFile);
    $pages = $pdf->getPages();
    foreach ($pages as $page) {
        $cnt = $page->getText();
        print($cnt);
    }
}


$parser = new Parser();
if (file_exists($pdfFile)) {
    $pdf = $parser->parseFile($pdfFile);
    $pages = $pdf->getPages();
    foreach ($pages as $page) {
        $data = $page->getDataTm();
        foreach ($data as $block) {
            $cnt = $block[1];
            print($cnt."\n");
            }
        }
    }
}

shshken avatar Mar 17 '25 07:03 shshken

PHP Version: 3.10

Please check again, PHP 3.10 is end-of-life more than 20 years or so.

getText can export all words from a PDF, but getDataTm returns part of them

Also, be more specific here.

Can you provide a test PDF?

k00ni avatar Mar 17 '25 08:03 k00ni

sorry,PHP version is 8.2.7. 发自我的 iPhone

在 2025年3月17日,16:53,Konrad Abicht @.***> 写道:



PHP Version: 3.10

Please check again, PHP 3.10 is end-of-life more than 20 years or so.

— Reply to this email directly, view it on GitHubhttps://github.com/smalot/pdfparser/issues/767#issuecomment-2728633280, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMPQUFLO7ZRRX6L2MGDPAOD2U2EP7AVCNFSM6AAAAABZEXL4H2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMRYGYZTGMRYGA. You are receiving this because you authored the thread.Message ID: @.***>

[k00ni]k00ni left a comment (smalot/pdfparser#767)https://github.com/smalot/pdfparser/issues/767#issuecomment-2728633280

PHP Version: 3.10

Please check again, PHP 3.10 is end-of-life more than 20 years or so.

— Reply to this email directly, view it on GitHubhttps://github.com/smalot/pdfparser/issues/767#issuecomment-2728633280, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMPQUFLO7ZRRX6L2MGDPAOD2U2EP7AVCNFSM6AAAAABZEXL4H2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDOMRYGYZTGMRYGA. You are receiving this because you authored the thread.Message ID: @.***>

shshken avatar Mar 18 '25 04:03 shshken

sorry, the PHP version is 8.2.7.

shshken avatar Mar 18 '25 04:03 shshken

I've seen what could the same issue that @shshken is describing. For me, getDataTM() is consistently missing the last text element from a page.

Using this PDF: https://northeastcommunityforest.org.uk/sites/default/files/10261667/2023-08/20230831_NECF_Annual-report-2023.pdf

On the 6th page (with the text "Watch the Town Moor project video") getDataTM() doesn't return the text "Moor project video".

If I step through getDataTM(), the string "Moor project video" is last in the $extractedTexts array, but is never added to the $extractedData array. When the loop over $dataCommands gets to that text, the commands are ET, Q, and then the function returns.

If you run this code with the PDF I linked to, you can see the missing text.

<?php

include './vendor/autoload.php';

$parser = new Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('20230831_NECF_Annual-report-2023.pdf');
foreach ($pdf->getPages() as $i => $page) {
  echo "\n\n#####\nPage " . $i+1 . "\n#####\n\n";
  foreach ($page->getDataTm() as $data) {
    if (isset($data[1]) && ($data[1] !== '')) {
      echo $data[1] . "\n";
    }
  }
}

Eg: Page 1 is missing "northeastcommunityforest.org.uk" Page 2 has an empty string as its last element, so nothing appears missing. Page 3 is missing "wellbeing and contributing to efforts to tackle climate change." Page 4 is missing "Instagram" Page 5 is missing "achievements 2022/23" Page 6 is missing "Moor project video". etc...

rupertj avatar Jul 15 '25 11:07 rupertj

I think this issue is related to #733, if not exactly the same thing.

I've just tried parpalak's change in the other issue of commenting out the contents of the 'Do' case in PdfObject::getTextArray() and my test script above now successfully returns all the expected text.

rupertj avatar Jul 16 '25 11:07 rupertj

@shshken does #775 help you in this regard?

k00ni avatar Jul 29 '25 14:07 k00ni