pdfparser icon indicating copy to clipboard operation
pdfparser copied to clipboard

getDataTm is not returning all the text

Open ridgey-dev opened this issue 10 months ago • 5 comments

  • PHP Version: 8.2.27
  • PDFParser Version: 2.11.0

Description:

PDF input

pdf-with-text.pdf

Expected output & actual output

The PDF contains the word {{signer1}}, but the getDataTm does not return this text for the second page.

[
  [
    [
      "1",
      "0",
      "0",
      "1",
      "36.266",
      "754.031"
    ],
    "Test"
  ],
  [
    [
      "1",
      "0",
      "0",
      "1",
      "34.016",
      "701.653"
    ],
    ""
  ]
]

Note that: The getTextArray is returning {{signer1}}. The problem has to do something with getTextArray returning an empty string for the second page. (Probably because of the image?)

[
  "Test",
  "",
  "{{signer1}}"
]

Code

$parser = new Parser();
$document = $parser->parseContent($content);

foreach ($document->getPages() as $page) {
    foreach ($page->getDataTm() as $value) {
        var_dump($value);
    }
}

ridgey-dev avatar Feb 06 '25 11:02 ridgey-dev

// Extract PDF (Text Based Content) $parser = new Parser(); $pdf = $parser->parseFile($filePath); $resumeContent = $pdf->getText();

This works for me

MaheKarim avatar Feb 12 '25 09:02 MaheKarim

Yup. That is because the getText function contains some trim logic:

https://github.com/smalot/pdfparser/blob/0ddcc54910411a5189f109dc9739318308cd2f86/src/Smalot/PdfParser/Document.php#L439

But sadly, I need the position of the text too. So I can't just use the getText function.

Maybe a solution could be to move that check to getTextArray ? I'm not very familiar with that code, so it would be helpful if someone can think about this.

ridgey-dev avatar Feb 12 '25 10:02 ridgey-dev

I created a PR to fix this: https://github.com/smalot/pdfparser/pull/762. Can someone take a look at it?

ridgey-dev avatar Feb 12 '25 13:02 ridgey-dev

I think changing getTextArray() is the wrong approach. This will break functionality for people who expect elements with empty strings.

Instead, can you figure out why Page::getDataTm() is not returning multiple elements for each page? Admittedly, I'm not really sure how this method works.

unixnut avatar Feb 27 '25 04:02 unixnut

I see that the PR is created is now closed due to the breaking change.

I just want this PDF to return the correct text position. Is there someone who can look into this?

@j0k3r what do you suggest we do here?

ridgey-dev avatar Mar 12 '25 12:03 ridgey-dev

I've just tried the test case from this issue with pdfparser 2.12.1 and get this result:

array(2) {
  [0]=>
  array(6) {
    [0]=>
    string(1) "1"
    [1]=>
    string(1) "0"
    [2]=>
    string(1) "0"
    [3]=>
    string(1) "1"
    [4]=>
    string(4) "56.8"
    [5]=>
    string(5) "758.1"
  }
  [1]=>
  string(4) "Dumm"
}
array(2) {
  [0]=>
  array(6) {
    [0]=>
    string(1) "1"
    [1]=>
    string(1) "0"
    [2]=>
    string(1) "0"
    [3]=>
    string(1) "1"
    [4]=>
    string(5) "106.9"
    [5]=>
    string(5) "758.1"
  }
  [1]=>
  string(1) "y"
}
array(2) {
  [0]=>
  array(6) {
    [0]=>
    string(1) "1"
    [1]=>
    string(1) "0"
    [2]=>
    string(1) "0"
    [3]=>
    string(1) "1"
    [4]=>
    string(5) "115.9"
    [5]=>
    string(5) "758.1"
  }
  [1]=>
  string(1) " "
}
array(2) {
  [0]=>
  array(6) {
    [0]=>
    string(1) "1"
    [1]=>
    string(1) "0"
    [2]=>
    string(1) "0"
    [3]=>
    string(1) "1"
    [4]=>
    string(5) "120.3"
    [5]=>
    string(5) "758.1"
  }
  [1]=>
  string(3) "PDF"
}
array(2) {
  [0]=>
  array(6) {
    [0]=>
    string(1) "1"
    [1]=>
    string(1) "0"
    [2]=>
    string(1) "0"
    [3]=>
    string(1) "1"
    [4]=>
    string(5) "152.5"
    [5]=>
    string(5) "758.1"
  }
  [1]=>
  string(3) " fi"
}
array(2) {
  [0]=>
  array(6) {
    [0]=>
    string(1) "1"
    [1]=>
    string(1) "0"
    [2]=>
    string(1) "0"
    [3]=>
    string(1) "1"
    [4]=>
    string(5) "166.8"
    [5]=>
    string(5) "758.1"
  }
  [1]=>
  string(2) "le"
}
array(2) {
  [0]=>
  array(6) {
    [0]=>
    string(1) "1"
    [1]=>
    string(1) "0"
    [2]=>
    string(1) "0"
    [3]=>
    string(1) "1"
    [4]=>
    string(6) "36.266"
    [5]=>
    string(7) "754.031"
  }
  [1]=>
  string(4) "Test"
}
array(2) {
  [0]=>
  array(6) {
    [0]=>
    string(1) "1"
    [1]=>
    string(1) "0"
    [2]=>
    string(1) "0"
    [3]=>
    string(1) "1"
    [4]=>
    string(6) "34.016"
    [5]=>
    string(7) "701.653"
  }
  [1]=>
  string(11) "{{signer1}}"
}

So it looks like #775 did fix this issue as @daniser suggested, and it can be closed now.

rupertj avatar Nov 03 '25 09:11 rupertj