pdfparser icon indicating copy to clipboard operation
pdfparser copied to clipboard

getDataTm() positions wrong?

Open dartheditous opened this issue 1 year ago • 4 comments

  • PHP Version: 7.3.25
  • PDFParser Version: latest

Description:

I'm giving PDFParser a try, but already I can see that the positions reported when using getDataTm() seem to have various issues.

In one PDF, an invoice, all elements but one were reported as having the same Y coordinate (10.0049). Most of the X coordinates look wrong as well, like an element on the left has a higher X coordinate than an element on the right, and the numbers themselves don't seem right either. E.g. an element roughly in the middle of the page has X = 3.66, but then an element to the left has X = 37.1.

I tried another PDF, another invoice, and looked specifically at one of the invoice lines. My own parser gave mt this:

[553.19] => Array // <- Y coordinate ( [46.84] => 1 [66.83] => GSC12708 [156.84] => Jujutsu Kaisen 0 Nendoroid Action Figure [361.72] => 5 [396.08] => 27.36 [429.30] => 1,50 % [481.08] => 26.95 [526.52] => 134.75 ) [543.19] => Array ( [156.84] => Suguru Geto: Jujutsu Kaisen 0 Ver. 10 cm )

This shows one line of several elements with the same Y coordinate (553.19), and one element at a different Y coordinate (the description of the item went onto two lines).

In PDFParser, the Y coordinates of those two description elements are swapped! The one beginning "Jujutsu" says 543.19, the one beginning Suguru says 553.19.

I am missing something here, or is it just broken?

PDF input

Expected output & actual output

Code

dartheditous avatar Feb 01 '24 20:02 dartheditous

On the first issue, looks like pdfparser doesn't account for changes to the transformation matrix, which affects final text positions.

dartheditous avatar Feb 02 '24 12:02 dartheditous

i'm facing the same issue, ex i use fpdi to get width and height of page, but after compare with fpdi different result. pdfparser {widht: 612, height: 792}, fpdi {width:215.9,height:279.4} fpdi result is true size

MuhammadAnsoriNasution avatar Feb 06 '24 14:02 MuhammadAnsoriNasution

Found this issue after experiencing an issue with Page::getTmData() myself. All elements on one line have the same height except the last one. The last one's height is higher (as if the item was lower) and instead of having high x-coordinate it supposedly at the beginning of the line -- simply put, in my case it seems that contents vs coordinates are shifted by one for me (PHP 8.3, pdfparser 2.10.0)

lee-van-oetz avatar Jun 11 '24 10:06 lee-van-oetz

I believe lee-van-oetz is correct, it's an off-by-one error.

When I run:

$data = $page->getDataTm(); foreach($data as $k => $td){ var_dump($data); }

The result is rather long, but it looks to me like the first element in the array includes an empty string, but the coordinates of the first text item. Everything following that has the text strings, and the coordinates are for the item that follows.

jamesmarchment avatar Jun 25 '24 00:06 jamesmarchment