getDataTm is not returning all the text
- PHP Version: 8.2.27
- PDFParser Version: 2.11.0
Description:
PDF input
Expected output & actual output
The PDF contains the word {{signer1}}, but the getDataTm does not return this text for the second page.
[
[
[
"1",
"0",
"0",
"1",
"36.266",
"754.031"
],
"Test"
],
[
[
"1",
"0",
"0",
"1",
"34.016",
"701.653"
],
""
]
]
Note that: The getTextArray is returning {{signer1}}. The problem has to do something with getTextArray returning an empty string for the second page. (Probably because of the image?)
[
"Test",
"",
"{{signer1}}"
]
Code
$parser = new Parser();
$document = $parser->parseContent($content);
foreach ($document->getPages() as $page) {
foreach ($page->getDataTm() as $value) {
var_dump($value);
}
}
// Extract PDF (Text Based Content) $parser = new Parser(); $pdf = $parser->parseFile($filePath); $resumeContent = $pdf->getText();
This works for me
Yup. That is because the getText function contains some trim logic:
https://github.com/smalot/pdfparser/blob/0ddcc54910411a5189f109dc9739318308cd2f86/src/Smalot/PdfParser/Document.php#L439
But sadly, I need the position of the text too. So I can't just use the getText function.
Maybe a solution could be to move that check to getTextArray ? I'm not very familiar with that code, so it would be helpful if someone can think about this.
I created a PR to fix this: https://github.com/smalot/pdfparser/pull/762. Can someone take a look at it?
I think changing getTextArray() is the wrong approach. This will break functionality for people who expect elements with empty strings.
Instead, can you figure out why Page::getDataTm() is not returning multiple elements for each page? Admittedly, I'm not really sure how this method works.
I see that the PR is created is now closed due to the breaking change.
I just want this PDF to return the correct text position. Is there someone who can look into this?
@j0k3r what do you suggest we do here?
I've just tried the test case from this issue with pdfparser 2.12.1 and get this result:
array(2) {
[0]=>
array(6) {
[0]=>
string(1) "1"
[1]=>
string(1) "0"
[2]=>
string(1) "0"
[3]=>
string(1) "1"
[4]=>
string(4) "56.8"
[5]=>
string(5) "758.1"
}
[1]=>
string(4) "Dumm"
}
array(2) {
[0]=>
array(6) {
[0]=>
string(1) "1"
[1]=>
string(1) "0"
[2]=>
string(1) "0"
[3]=>
string(1) "1"
[4]=>
string(5) "106.9"
[5]=>
string(5) "758.1"
}
[1]=>
string(1) "y"
}
array(2) {
[0]=>
array(6) {
[0]=>
string(1) "1"
[1]=>
string(1) "0"
[2]=>
string(1) "0"
[3]=>
string(1) "1"
[4]=>
string(5) "115.9"
[5]=>
string(5) "758.1"
}
[1]=>
string(1) " "
}
array(2) {
[0]=>
array(6) {
[0]=>
string(1) "1"
[1]=>
string(1) "0"
[2]=>
string(1) "0"
[3]=>
string(1) "1"
[4]=>
string(5) "120.3"
[5]=>
string(5) "758.1"
}
[1]=>
string(3) "PDF"
}
array(2) {
[0]=>
array(6) {
[0]=>
string(1) "1"
[1]=>
string(1) "0"
[2]=>
string(1) "0"
[3]=>
string(1) "1"
[4]=>
string(5) "152.5"
[5]=>
string(5) "758.1"
}
[1]=>
string(3) " fi"
}
array(2) {
[0]=>
array(6) {
[0]=>
string(1) "1"
[1]=>
string(1) "0"
[2]=>
string(1) "0"
[3]=>
string(1) "1"
[4]=>
string(5) "166.8"
[5]=>
string(5) "758.1"
}
[1]=>
string(2) "le"
}
array(2) {
[0]=>
array(6) {
[0]=>
string(1) "1"
[1]=>
string(1) "0"
[2]=>
string(1) "0"
[3]=>
string(1) "1"
[4]=>
string(6) "36.266"
[5]=>
string(7) "754.031"
}
[1]=>
string(4) "Test"
}
array(2) {
[0]=>
array(6) {
[0]=>
string(1) "1"
[1]=>
string(1) "0"
[2]=>
string(1) "0"
[3]=>
string(1) "1"
[4]=>
string(6) "34.016"
[5]=>
string(7) "701.653"
}
[1]=>
string(11) "{{signer1}}"
}
So it looks like #775 did fix this issue as @daniser suggested, and it can be closed now.